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INTRODUCTION 


Early  computer  vision  research  was  mainly  concerned  with  operations 
on  pictures — such  as  encoding,  enhancement,  and  edge  detection  [Rosenfeld, 
1969a] — and  with  analysis  of  single  images--for  example,  interpreting  images 
containing  bodies  from  a  known  set  of  objects  [Roberts,  1963;  Guzman,  19681. 

Early  matching  work  fell  into  the  domain  of  pattern 
recognition — matching  a  description  of  an  idealized  object  against 
descriptions  generated  from  analysis  of  an  image  containing  that  object. 
Some  pixel-by-pixel  matching  was  done  in  matching  a  template  of  an 
alphanumeric  character  against  oictures  of  hand-printed  characters. 
(Rosenfeld  (1969b  and  1973]  provides  excellent  surveys  of  the  literature  for 
single  image  processing.) 

Stereo  vision  was  for  a  long  time  the  domain  of  psychologists  and 
physiologists,  whose  interests  were  in  understanding  human  stereo  vision 
[Julesz,  1961].  The  major  use  of  stereo  was  in  photogrammetry--conver t  i ng 
aerial  photographs  to  contour  maps,  usually  by  optical  methods  [Bouchard  and 
Moffitt,  1965). 

Eventually,  computer  stereo  image  processing  became  attractive. 
Julesz  [19631  saw  it  as  a  method  of  studying  human  stereo  perception. 
Computer  photogrammetry  techniques  were  developed  and  used  to  deal  with 
telemetered  image  data  from  satellites.  Image  processors  began  to  use  stereo 
to  determine  depth  information  [Quam  et.  al.,  1972]. 

All  of  these  applications  required  efficient  ways  of  matching  areas 
of  one  picture  with  the  corresponding  areas  of  another,  similar  picture. 
Quam  [1971]  developed  a  spiraling,  stepping  algorithm  to  facilitate  his 
aligning  of  Mariner  spacecraft  images  *or  variable  feature  detection.  Barnea 
and  Silverman  [1972]  reported  a  sequential  decision  algorithm  uhich  they  used 
in  matching  weather  satellite  photos.  Other  general  investigations  of 
matching  have  been  done  by  Fischler  [1971]  and  by  Fischler  and  Elschlager 
[1971]  . 


STATEMENT  Cc  THE  PROBLEM 

Uhat  is  matching?  By  matching,  we  mean  the  process  of  binding,  for  a 
given  sub-area  (window)  of  an  image  X,  the  sub-area  of  image  Y  which  contains 
point  for  point  the  same  intensity  information.  Matching  should  not  be 


confused  with  mapping.  Mapping  implies  that  there  is  some  general  function 
(I y *  - y )  -  (Ix.Jx)  which  gives  the  position  of  corresponding  points  in  image 
Y  for  a  given  set  of  points  in  image  X.  Matching  is  a  soecial  case  of 
mappmg--the  case  in  which  the  mapping  function  is  a  simple  translation  of 
axes,  ( I y , Jy )  -  (Ix,Jx)  +  (Ti,Tj)  within  the  area  buing  matched. 

This  thesis  is  concerned  with  matching,  not  mapping.  Therefore  we 
are  limited  to  those  areas  of  pairs  of  images  which  do  not  have  large 
perspective  changes  from  one  view  to  the  other.  This  condition'  is  met  bu 
small  angle  stereo  and  by  distant  objects  in  larger  angle  stereo  pairs.  Ue 
must  also  exclude  areas  of  images  representing  objects  which  themselves  move 
or  are  moved  so  as  to  present  differing  projections  to  the  cameras. 
Similarly,  we  u>  also  limit  ourselves  to  high-quality  pictures,  i.e.,  those 
without  scratc.e,  or  other  blemishes  on  the  negatives,  those  having  low 
noise,  etc.  These  limitations  assure  that  our  target  areas  will  have 
matching  candidate  areas. 


The  subject  of  this  thesis  is  as  follows:  given  two  images  of  a 
scene  constrained  as  above,  use  the  information  in  the  pictures  to  match 
target  area  A  of  image  X  with  its  corresponding  candidate  area  in  image  Y 
He  w.M  discjss  general  techniques  for  matching,  efficient  methods  by  which 
matching  can  be  done,  some  of  the  problems  that  can  occur  when  matching  real 
data,  and  ways  of  extending  matching  areas.  In  addition,  we  will  describe 
some  of  the  algorithms  which  have  been  implemented  to  use  these  techniques. 


DESIGN  OF  THE  INVESTIGATION 


Picture  processing  is,  for  the  most  part,  an  applied  science.  It 
seeks  to  show  that  something  is  possible,  not  by  formally  proving  that  it  can 
be  done,  but  by  doing  it.  In  keeping  with  this  spirit,  this  dissertation 
will  contain  no  formal  proofs  of  existence,  termination,  correctness  or 
running  time.  It  will  contain  discussions  of  techniques  and  a  I  gor  i  thms  and 
reports  on  how  well  these  techniques  work  when  implemented. 

There  is,  underlying  all  techniques  presented  in  this  thesis,  a  veru 
basic  philosophy.  Machine  vision  will  m  the  near  future  be  used  for  those 
tasks  which  man  can  do  hut  doesn't  want  to,  such  as  assembly  line  drudgery 
or  those  tasks  which  he  wants  to  do  but  can't,  such  as  exploring  inhospitable 
planets.  In  the  first  case,  the  structure  of  the  task  environment  is  well 
nown  and  can  be  used  to  make  the  performance  of  the  task  more  efficient 
This  is  the  problem  addressed  -»nd  approach  used  by  the  Hand-Eye  group  at  the 
Stanford  Artificial  Intelligence  Project  [Feldman,  1963] .  In  the  second 
case  the  structure  of  the  environment  is  only  crudely  known,  hence  can  only 
loosely  be  used  to  expedite  the  task.  ® 


thes  i  s 


It  is  this  latter  variety  of  problem  for  which  the 
uere  to  be  designed.  Consequently,  we  will  avoid 


techniques  of  this 
whenever  possible 
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overspecialization  through  the  use  of  particular  assumed  structure  or 
semantics  in  the  completion  of  our  tasks.  Our  techniques  may  not  be  as 
pouerful  as  those  using  such  information,  but  they  mil  be  more  general. 

Most  of  the  techniques  described  in  this  thesis  have  been  programmed; 
those  which  have  not  will  be  so  noted.  The  photographic  illustrations  in 
this  thesis  are  derived  from  visual  outpit  generated  by  these  programs  on  a 
television  monitor.  No  photographic  trickery  has  been  done;  what  the  reader 
sees  is  roughly  what  a  person  operating  that  program  would  see  on  his 
mon i tor . 


DEFINITIONS 


Some  of  the  terms  from  the  field  of  computer  vision  which  are  used  in 

this  thesis  are  defined  below. 

Picture — a  two-dimensional  array  of  integer  values  which  represent  the  light 
intensities  of  a  scene  at  some  set  of  grid  points. 

Point — one  of  the  array  elements  of  a  picture. 

Pixel  — (contract i on  of  picture  element)  a  point  in  a  picture. 

Color  picture— a  set  of  three  pictures,  representing  the  red,  green,  and  blue 
filter  components  of  a  color  photograph  or  a  color  television 
p i c  ture. 

Image— the  set  of  pictures  representing  a  photograph— one  picture  for  a 
black-and-white  photograph  or  three  pictures  for  a  color  photograph. 


CONVENTIONS  OF  PICTURE  PROCESSING 


In  keeping  with  the  conventions  used  in  the  television  industry, 
pixels  are  identified  by  their  (I,J>  positions  with  respect  to  the  upper 
left-hand  corner  of  the  picture,  which  has  position  (0,0).  The  I-dimension 
increases  to  the  right;  che  J-dimension  increases  downward. 

The  intensities  at  each  pixel  are  represented  by  numbers  from  0 
through  k  -  2n  -  1,  with  0  representing  no  light,  or  black,  and  k 
representing  full  light,  or  white.  Pixels  are  stored  packed,  as  many  as  ui  I  I 
f  i  t  per  word  of  computer  memory. 
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NOTATIONAL  CONVENTIONS 


As  in  normal  programming  usage,  the  following  compromises  with 
standard  mathematical  notation  have  been  made. 

Square  root  signs  are  replaced  by  the  function  SORT. 

The  raised  dot  for  mu  1 1  i  p  1 1  cat  *  on  is  replaced  by  *. 

The  following  mathematical  conventions  are  used. 

Summation  signs  are  indicated  by  a  sigma.  The  variable  which  is  being  summed 
over  is  written  below  the  sigma.  When  exact  ranges  for  the  summation 
are  to  be  given,  they  are  given  as  a  boolean  expression  in  the  place 
of  the  summation  variable.  The  function  being  summed  is  written  to 
the  right  of  the  sigma.  Parentheses  are  used  only  when  necessary  to 
avoid  confusion. 

Examples!  I  X;  and  1  X; 

i  as i <b 

The  mean  of  a  variable  is  indicated  by  overbar  notation. 

.  *  X, 
i 

X  -  - 

I  1 

i 


OTHER  CONVENTIONS 


Illustrations  are  numbered  with  Arabic  numerals  within  chapters  and 
are  prefaced  by  the  chapter  number,  e.g.  the  first  i  1 1 ust-at i on  in  Chapter  3 
is  Illustration  3-1.  All  illustrations  for  a  given  chapter  appear  together 
at  the  end  of  the  chapter.  Prints  of  the  original  data  appear  In  Appendix  A. 

Equations  are  numbered  with  lower  case  Roman  letters  within  chapters 
and  are  prefaced  by  the  chapter  number,  e.g.  the  first  equation  in  Chapter  2 
•s  Equation  2-a. 


Chapter  2 


BASIC  AREA  MATCHING  TOOLS  AND  TECHNIQUES 


Suppose  one  has  been  given  two  digitized  photographs  uhich  were  taken 
of  the  same  scene,  but  which  differ  in  some  respect,  such  as  point  of  view. 
Consider  the  problem  of  using  a  computer  to  determine  which  area  of  picture  Y 
(candidate  area)  best  matches  a  given  area  of  picture  X  (target  area). 

Geometrically,  two  areas  match  if  they  both  are  projections  of  the 
same  three-dimensional  piece  of  scene.  Intuitively,  two  areas  match  if  they 
"look  the  same".  Computationally,  two  areas  match  if  a  calculated  measure  of 
match  between  them  is  sufficiently  optimal. 


CORRELATION 


Since  we  are  dealing  with  the  probability  of  a  match  occurina,  some 
statistical  measure  is  desirable  as  the  measure  of  -latch.  The  common  measure 
for  this  is  discrete  correlation, 

COR  ■  I  X,  *  Y-, 


which  can  be  normalized  by  the  means  of  the  samples 

COR  -  1  (  Xj  -  X  )  *  (  Yj  -  Y  ) 
i 

or  by  the  second  moments  of  the  samples 


COR 


or  by  both. 


I  X  -,  *  Yj 


SORT (  I  Xja  *  2  Y|2  ) 

i  i 


COR 


I  (  Xj  -  X  )  *  (  Yi  -  Y  ) 


SQR't  (I(Xj-X)a*I(Yj-Y)2) 
i  i 


The  last  is  the  nicest  to  uork  with,  since  is  has  an  absolute  value 
less  than  or  equal  to  one,  and  its  absolute  value  equals  one  if  and  only  if 
X i  -  a*Y i  +  b  for  all  i . 
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DIFFERENCE  MEASURES 


Also  used  are  measures  based  on  the  difference  between  the  samples 
over  the  tuo  areas,  such  as  root-mean-square  error, 


RMS  -  SORT (  1/n  X  (X;  -  Y;  > a  ) 


which  can  also  be  normalized  by  the  means  of  the  samples. 


RMS  -  SQRT (  1/n  2  (  (X,  -  X  )  -  (  Y,  -  Y  )  )a  ) 


(2-a) 


Absolute  difference  is  also  used. 


AD  -  X  |X;  -  Y, |  /n 


It  too  can  be  normalized  by  the  means. 


AD  -  X  |  (  Xj  -  X  >  -  (  Y;  -  Y  >  |  /n 


The  calculation  of  normalized  absolute  difference,  houever,  requires 
tuo  passes  over  the  data--one  to  calculate  the  sample  means  and  one  to  sum 
the  absolute  differences  uhich  include  these  sample  means.  All  other 
measures  mentioned  here,  including  normalized  correlation  and  normalized  RMS, 
can  be  calculated  in  one  pass  over  the  data.  What  distinguishes  normalized 
absolute  difference  from  the  rest  is  the  presence  of  summations  both  inside 
and  outside  of  the  absolute  value  sign.  Absolute  value  i  3  not  a  linear 
operator,  therefore  effectively  foils  the  algebraic  manipulations  of 
summations  which  permit  the  other  normalized  measures  to  be  calculated  in  one 
pass.  Because  of  the  added  inconvenience  of  a  second  pass  over  the  data, 
normalized  absolute  difference  is  rarely  used. 


Both  RMS  and  absolute  difference  yield  values  between  zero  and  a 
number  bounded  by  the  largest  lifference  between  the  samples,  uhich  is  in 
turn  bounded  by  the  maximum  poss  die  intensity  at  a  pixel. 


COMPARISON  OF  THE  MEASURES  JF  MATCH 


Perhaps  at  this  point  a  few  words  should  be  said  about  the  relative 
merits  of  correlation,  RMS,  and  absolute  difference  as  measures  of  match. 


RMS  and  absolute  difference  are  clearly  related.  There  is  ai'io  a 
relationship  between  normalized  RMS  and  normalized  correlation.  In  the 
fol lowing,  let 


T (X, Y)  -X(Xj-X)*(Yj-Y) 
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Correlation  can  nou  be  expressed  as 


COR 


T  (X.  Y) 


SORT (  T(X,X)  *  T (Y,  Y)  ) 


Equation  2-a  expends  to 


RilS  -  SQRT(  1/n  I  (  (X;  -  X)2  2*  (  X ;  -  X  )*(  Y;  -  Y  )  +  (  Yj  -  Y)2  )  ) 

-  SORT (  1/n  (  T (X, X)  -  2*T(X,Y)  +  T(Y, Y)  )  )  . 

Hence,  we  have 


T  (X,  X)  +  T  (Y,  Y)  -  n  *  RMS* 

COR  - - . 

2  *  SORT (  T (X, X)  *  T (Y, Y)  ) 

Being  related,  correlation,  RMS,  and  absolute  difference  might  be  expected  to 
give  similar  results  when  used  as  the  measure  match. 

The  cneapest  measure  of  match,  in  terms  of  the  number  of  instructions 
required  to  implement  it,  is  absolute  difference.  Two  samples  which  match 
exactly  have  an  absolute  difference  of  zero.  It  may  be  the  case,  however, 
that  the  pixel  intensities  in  the  candidate  area  equal  those  in  the  target 
area  plus  a  constant  (offset),  that  is,  Yj  -  X;  +  b.  In  this  case,  the 
absolute  difference  between  the  two  intuitive  matching  areas  would  be 
non-zero,  perhaps  greater  than  the  absolute  difference  for  some  other  area 
which  is  similar,  but  not  intuitively  the  matching  area. 

Normalized  RMS  takes  care  of  this  problem  by  subtracting  the  means  of 
the  two  areas  from  each  of  the  intensity  values  within  the  samples.  It 
trades  a  little  more  time  in  the  calculation  of  the  measure  of  match  for  more 
flexibility  in  its  application. 

Suppose,  however,  th*jt  the  pixel  intensities  from  the  matching  area 
are  equal  to  a  constant  factor  (gain)  times  the  intensities  from  the  target 
area,  plus  some  constant  offset,  tha.  is,  Yj  -  a*Xj  +  b.  The  value  of  RMS 
over  matching  areas  in  this  case  is  non-zero.  This  can  result  in  rejection 
of  a  matching  area  should  some  non-matching  but  relatively  similar  area 
contain  data  uhich  has  a  relative  gain  of  one. 

Normalized  correlation,  although  more  expensive,  is  designed  to 
handle  both  a  constant  gain  and  a  constant  offset.  Subtracting  the  means 
removes  the  problem  of  the  offset*,  dividing  by  the  variances  takes  care  of 
the  gain.  This  can  lead  to  multiple  match  candidates  if  several  areas  of 
different  relative  gains  and  offsets  resemble  each  other.  However,  this 
merely  introduces  impostors,  it  does  not  discard  true  matches. 


Because  relative  gain  and  offset  are  frequently  present  in  digital 
stereo  images,  the  author  prefers  normalized  cross-correlation  to  the  other 
measures  of  match,  and  has  developed  matching  techniques  centered  around 
correlation.  However,  if  gain  and  offset  are  not  a  problem,  or  are  known  and 
can  be  taken  into  account  in  the  calculation  of  the  difference  measures,  then 
the  techniques  presented  in  this  thesis  can  be  adapted  to  normalized  RMS  or 
absolute  difference.  Since  the  techniques  'f  this  thesis  were  developed  and 
originally  implemented  with  correlation,  they  are  discussed  in  terms  of 
corre I  at i on. 


FAST  FOURIER  TRANSFORMS  FOR  CONVOLUTION 

Fast  Fourier  convolution  is  often  mentioned  as  a  tool  for  matching. 
It  is  a  method  for  calculating  the  2  XV  term  used  in  correlation  and  RMS 
error  somewhat  more  efficiency. 

This  2  XY  term  is  the  discrete  convolution  of  the  two  samples;  the 
Fourier  transform  of  this  convolution  is  equivalent  to  the  product  of  the 
Fourier  transforms  of  the  two  samples.  Thus  it  is  possible  to  do  the 
summation  by  taking  the  transforms  of  the  two  samples,  multiplying  them,  then 
taking  the  inverse  Fourier  transform  of  the  result.  If  this  is  done  for  a 
target  area  out  of  picture  X  and  all  of  picture  Y,  the  result  is  an  array, 
each  element  of  which  contains  the  value  of  the  convolution  between  the 
candidate  area  centered  at  that  point  and  the  target  area. 

Uith  the  fast  Fourier  transform,  it  is  possible  to  dr  a  transform  of 
a  sample  of  m  -  2"  points  in  time  proportional  to  m  log2  m  [Singleton,  1967]  . 
Let  N  be  the  maximum  dimension  of  picture  X  and  U  be  that  of  the  window  being 
matched.  Oue  to  the  aliasing  problem,  it  is  necessary  that  m  be  not  less 
than  N+U  [Cooley,  et.  al.,  1967],  as  well  as  being  a  power  of  two.  If  we  let 
L  be  the  constant  necessary  to  bring  N+U  up  to  2n,  then  the  2  XY  for  N2 
correlations  can  be  done  in  time  proportional  to  (N+U+L)2  log2  (N+U+L)2  by 
the  FFT  method,  as  compared  to  time  proportional  to  N2U2  for  the  direct 
computation.  Of  course,  for  normalized  correlation  or  normalized  RMS,  it  is 
still  necessary  to  compute  2  X,  2  X2,  2  Y,  and  2  Y2  directly,  and  to  combine 
them  in  order  to  calculate  each  of  the  measures  of  match.  Employing  sliding 
sums  to  calculate  these  terms  adds  time  proportional  to  N2;  time  proportional 
to  N2  is  also  added  to  combine  the  sums  for  calculating  the  measures  of 
match. 


Uhich  method  is  faster  for  a  given  problem  will  depend  on  the  valuee 
of  N  and  U  and  the  constants  of  proport  ional  i  ty,  which  depend  on  the 
Implementations.  Illustration  2-1  compares  the  FFT  approach  with  the  direct 
approach  for  the  implementations  used  at  Stanford  A. I.  and  several  values  of 
N  and  U. 
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AREA  SAfIPLING 


Considerable  time  is  wasted  in  calculating  the  measure  of  match  over 
all  the  pixels  in  every  candidate  area  in  the  second  picture.  Like  most 
searches,  the  search  for  a  match  spends  most  of  its  time  fai  I  ing--calculat  ing 
Ire  measure  of  match  for  areas  that  don't  match.  If  one  can  reduce  the 
amount  of  time  spent  failing,  a  significant  saving  ui 1 1  result. 

Barnea  and  Silverman  [19723  observed  that,  for  most  candidate  areas, 
it  becomes  obvious  after  a  small  fraction  of  the  points  in  the  area  have  been 
processed  that  the  measure  of  match  is  going  to  have  a  non-optimal  value.  If 
processing  of  that  area  is  aborted  when  the  area's  non-optimality  is 
discovered,  a  considerable  savings  of  time  results. 

Toward  this  end,  they  propose  the  following  sequential  decision 
algorithm.  Start  calculating  the  measure  of  match,  taking  corresponding 
pairs  of  sample  elements  out  of  the  two  areas  in  pseudo-random  order.  At 
intervals,  monitor  the  value  of  the  measure  of  match.  If  at  any  time  the 
measure  is  non-optimal  according  to  their  decision  criteria,  discontinue  the 
calculations  and  discard  the  area  as  non-matching.  Otherwise,  continue 
adding  in  samples  randomly  until  either  the  whole  area  has  been  included  or 
the  measure  becomes  non-optimal. 

Barnea  and  Silverman  claim  that  this  algorithm  is  up  to  50  times 
faster  than  matching  by  ordinary  correlation  techniques.  Unfortunately,  they 
do  not  separate  th*  savings  due  to  their  using  absolute  difference  as  the 
measure  of  match  from  the  savings  due  to  the  algorithm  itself.  Quam 
[unpublished  research,  19733  finds,  in  ons  particular  application,  that  their 
algorithm  used  with  normalized  RMS  is  five  to  ten  times  as  fast  as  ordinary 
normalized  correlation  techniques. 

Reducing  the  number  of  points  handled  in  some  of  the  sample  areas  is 
one  side  of  the  coin.  The  other  side  represents  the  possibility  of  not 
calculating  that  measure  of  match  for  every  candidate  area  in  the  second 
picture. 
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Direct  method.  Tabulated  values  are  0.000026  Na  Wa  seconds. 


rt 

U  11 

13 

15 

17 

19 

21 

100 

31.4600 

43.9400 

58.5000 

75.1400 

93.8600 

114.6600 

150 

70.7850 

98.8650 

131.6250 

169.0650 

211.1850 

257.9850 

200 

125.8400 

175.7600 

234.0000 

300.5600 

375.4400 

458.6400 

250 

196.6250 

274.6250 

365.6250 

469.5250 

586.6250 

716.6250 

300 

283.1400 

395.4600 

526.5000 

676.2600 

844.7400 

1031.9400 

350 

385.3850 

538.2650 

716.6250 

920.4650 

1149.7850 

1404.5850 

400 

503.3600 

703.0400 

936.0000 

1202.2400 

1501.7600  . 

1834  5600 

450 

637.0650 

889.7850 

1184. S250 

1521.5850 

1900.6650 

2321.8650 

500 

786.5000 

1098.5000 

1462.5000 

1878.5000 

2246.5000 

2866.5000 

FFT  method.  Tabulated  values  are  0.000080  *  4(  N+U+L  )a  log2(  N+U+L  )  seconds, 
ie.  2  FFT's — one  for  the  uindou,  one  to  inverse  FFT  the  product  of  the 
FFT  of  the  uindou  and  the  FFT  of  Picture  Y.  This  neglects  the  time 
needed  for  Na  complex  multiplies  to  form  the  product. 


N 

U  11 

13 

15 

17 

19 

21 

100 

36.7002 

36.7002 

36.7002 

36.7002 

36.7002 

36.7002 

150 

167.7722 

167.7722 

167.7722 

167.7722 

167.7722 

167.7722 

200 

167.7722 

167.7722 

167.7722 

167.772? 

167.7722 

167.7722 

250 

754.9747 

754.9747 

754.9747 

754.9747 

754.9747 

754.9747 

300 

754.9747 

754.9747 

754.9747 

754.9747 

754.9747 

754.9747 

350 

754.9747 

754.9747 

754.9747 

754.9747 

754.9747 

754.9747 

400 

754.9747 

754.9747 

754.9747 

754.9747 

754.9747 

754.9747 

450 

754.9747 

754.9747 

754. 97*, 7 

754.9747 

754.9747 

754.9747 

500 

754.9747 

3355.4432 

3355.4432 

3355.4432 

3355.4432 

3355.4432 

Illustration  2-1.  These  tables  compare  the  relative  efficiencies  of  the 
direct  method  and  FFT  for  calculating  the  convolution  2  XY  of  a  windOH  of  Ua 
points  uith  a  picture  of  Na  points.  The  constants  of  proportionality  are 
derived  from  machine  language  codings  on  the  PDP-10  at  Stanford  A. I.  by  Lynn 
Quam  (direct  method)  and  Don  Oestereicher  (FFT). 
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Chapter  3 


SEARCH  STRATEGIES  AND  REFINEMENTS 


The  idea  of  shortening  a  search  by  "pruning"  the  search  jpace  is  not 
a  neu  one.  Heuristic  search  has  been  a  part  of  artificial  intelligence  from 
the  beginning  [Nilsson,  1 372] .  The  basic  idea  is  simples  arrange  the  search 
in  such  a  uay  that  entire  sets  of  solutions  are  considered  at  once.  Attach 
to  each  set  some  uay  of  measuring  uhether  or  not  it  has  a  good  chance  of 
containing  the  desired  solution.  Work  in  detail  on  only  those  sets  uhich 
shou  promise.  Whenever  possible,  uork  first  on  those  sets  uhich  shou  most 
promise. 


Uith  most  pictorial  data,  tnere  is  a  fair  amount  of  local  coherence. 
By  this,  ue  mean  that  an  area  centered  at  one  pixel  does  not  usually  differ 
greatly  from  an  area  centered  at  a  neighboring  pixel.  An  alternate 
expression  of  this  uould  be  to  say  that  most  pictorial  data  consists 
primarily  of  lou  frequency  information.  This  makes  it  possible  to  use  one 
candidate  area  as  a  representative  of  a  set  of  areas  centered  at  adjacent 
points.  The  evaluation  of  some  computationally  inexpensive  measure  of 
agreement  over  the  rrpresentat  i ve  area  serves  as  the  evaluation  of  the  set. 
A  number  of  variations  on  this  technique  can  be  used  in  pruning  the  search 
for  a  match. 


GRIDOING 


Consider  for  a  moment  the  surface  formed  by  plotting  correlation  as  a 
function  of  position  of  candidate  area  centers  in  the  vicinity  of  the 
matching  -candidate  area.  Because  of  the  local  coherence  of  most  target 
areas,  this  correlation  surface  usually  falls  off  gradually  as  one  moves  auay 
from  the  matching  area's  center.  (See  Illustration  3-1).  Therefore,  in  the 
immediate  vicinity  of  the  peak  in  a  correlation  surface  uhich  indicates  a 
match,  the  correlations  uill  usually  be  above  some  threshold. 

One  can  take  advantage  of  this  fact  by  calculating  the  correlations 
betueen  the  target  area  and  candidate  areas  centered  at  points  on  an  n  by  n 
grid  over  picture  Y.  For  suitable  n  and  threshold,  it  is  clear  that  one  of 
the  grid  candidates  must  lie  someuhere  on  the  match  peak  above  the 
significance  threshold.  By  searching  in  detail  the  immediate  vicinity  of  any 
grid  candidate  shouing  a  significant  correlation,  one  can  locate  the  match 
pe2k.  This  is  the  technique  of  gridding. 

If  one  uses  an  n  by  n  grid  on  an  N  by  N  picture  and  finds  k 
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correlations  above  the  significance  threshold  Q,  one  calculates  about 
(N/n )  *  +  k*na  correlations  of  area  U*  in  finding  the  match.  In  comparison, 
the  direct  method  requires  N*  correlations  to  locate  the  match.  Since  in 
most  cases,  k  is  small,  gridding  results  in  a  savings  of  a  factor  of  na  over 
the  direct  method. 

The  success  of  gridding,  of  course,  lies  in  the  choice  of  n  and  of  Q, 
which  influences  k.  Examining  the  first  pair  of  correlation  cross  sections 
in  Illustration  3-1,  ue  see  that  for  Q-.5,  n  must  be  1,  but  if  Q-.l,  n  can  be 
5.  For  the  second  ^air,  Q-.5  means  n-6,  and  Q-.l  means  n-10.  The  allowable 
values  for  n  and  Q  are  not  only  interconnected,  but  also  depend  on  the 
individual  correlation  peak. 

However,  uhen  we  begin  our  search  for  a  natch  and  are  ready  to  set  n 
and  Q,  we  do  not  yet  know  uhat  the  correlation  peak  will  look  like.  Ue  do 
know  that,  under  ideal  conditions  for  matching,  the  target  area  will  very 
closely  resemble  its  match.  If  the  matching  area  ex'-ctly  duplicated  the 
target  area,  then  the  correlation  surface  would  be  identical  to  the 
autocorrelation  surface.  (See  Appendix  B.)  In  practice,  this  precise 
equivalence  does  not  hold;  however,  the  correlation  and  autocorrelation 
surfaces  do  resemble  each  other  (Sss  Illustration  3-2).  Hence  the 
autocorrelation  peak  can  give  a  good  indication  of  the  proper  n  and  Q  for  a 
givei.  target  area.  Extracting  this  information  can  be  done  by  inspection,  by 
fitting  a  second  order  surface  to  the  correlation  peak  and  measuring  its 
parameters,  or  by  examining  the  Fourier  transform  of  the  data. 

In  theory,  gridding  will  always  work,  since  the  worst  it  can  do  is 
degenerate  to  the  standard  method  of  evaluating  the  correlation  at  every 
point  when  n-1.  In  practice,  however,  gridding  is  not  used  if  the 
autocorrelation  peak  indicates  a  grid  spacing  of  1  or  2.  Such  an 
autocorrelation  peak  can  occur  if  the  target  area  contains  mainly  high 
frequency  information,  as  is  the  case  in  the  distant  trees  along  the  skyline 
in  the  Lab  pictures.  (See  the  copies  of  the  original  data  in  Appendix  A, 
coordinates  (112,20) in  the  first  image  and  (113,26)  in  the  second,  window 
radius-7.)  It  also  occurs  in  extremely  noisy  Images  and  ie  a  feature  of  eome 
artificially  generated  images  [Julesz,  1961  and  1963). 


REDUCTION 


One  technique  for  utilizing  local  coherence  to  make  the  amount  data  to  be 
handled  more  manageable  is  reduction  [see  e.g.  Kelly,  1970).  In  our 
application,  this  means  making  a  new  pair  of  pictures  by  spatially  reducing 
the  or iginal s--ef feet i ve I y  replacing  m  by  m  squares  of  pixels  by  one  pixel 
having  the  average  intensity  of  that  square.  Appropriate  areas  are  then 
matched  in  the  reduced  pictures.  Finally  the  correlation  peaks  for  the  areae 
found  to  match  best  In  the  smaller  pictures  ars  searched  In  the  original, 
higher  -coelution  pictures. 
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Doing  an  n  by  hi  spatial  reduction  on  the  pictures  means  that  there 
are  now  (N/m)2  potential  areas  for  the  reduced  target  area  to  match  instead 
of  N2,  a  savings  of  a  factor  of  m2  in  the  number  of  correlations  to  be 
calculated.  If  the  target  area  is  to  represent  the  same  objects,  then  Its 
size  is  also  reduced  from  U2  to  (U/m) 2  pixels.  This  results  in  a  savings  of 
a  factor  of  m2  in  the  correlation  calculation  loop,  for  an  overall  savings 
factor  of  m4. 

If  the  target  area  is  not  very  big  to  start  uith,  reducing  the  images 
may  cause  the  target  area  to  no  longer  represent  a  valid  statistical  sample. 
If  one  is  not  constrained  to  matching  any  particular  area,  but  can  enlarge 
the  effective  area  to  maintain  a  valid  sample  size  in  the  reduced  pictures, 
then  reduction  can  be  used.  The  savings  factor  mil  depend  on  the  exact  size 
of  the  window  which  must  be  u:  id,  but  should  be  somewhere  between  ma  and  m4. 

As  with  gritlding,  there  is  an  additive  term  of  k*m2  full  scale 
correlations  necessary  to  determine  the  location  of  the  unreduced  match. 
Here,  k  depends  on  how  many  areas  within  the  reduced  second  image  wi I  I 
resemble  the  reduced  target  area,  which  is  difficult  to  predict.  There  is 
also  the  overhead  of  reducing  the  two  images,  but  this  can  ofte,.  be  combined 
wi  th  some  other  necessary  processing. 

The  success  of  reduction  depends  on  the  choice  of  m,  which  In  turn 
depends  on  the  information  within  the  picture.  Intuitively,  ,f  most  of  the 
information  in  the  picture  lies  in  features  which  are  p  pixels  wide,  then  one 
does  not  wish  to  reduce  the  picture  by  a  factor  of  p  or  greater. 
Computationally,  if  the  Fourier  transform  of  the  picture  reveals  that  a 
significant  part  of  the  pouer  is  in  spatial  frequencies  higher  than  N/p,  one 
should  reduce  the  picture  by  a  factor  of  less  than  p.  In  general,  one  should 
avoid  reduction  by  a  factor  sufficiently  large  to  change  the  spatial 
frequency  or  information  content  of  the  pictures.  One  way  to  check  on  this 
is  to  examine  the  autocorrelation  peak  in  both  the  original  and  reduced 
pictures.  If  the  peak  is  much  narrower  in  the  original  than  in  the  reduced 
image,  too  much  reduction  has  happened. 

If  one  allows  the  choice  of  m  to  be  determined  by  the  data,  then  in 
theory,  reduction  will  always  work,  since  it  simply  degenerates  to  the 
standard  method  for  m-1.  If  a  larger  than  recommended  reduction  is 
employed— for  example  to  decrease  noise— then  the  possibility  exists  that  the 
technique  of  reduction  ui I  I  fail  to  produce  the  proper  match. 


SIMILARITY 


The  technique  of  similarity  differs  from  previously  described 
techniques  in  that  it  does  not  use  correlation  as  the  basis  for  pruning  the 
search  for  the  match.  The  idea  behind  similarity  Is  simple — if  two  areas 
match,  then  statistical  measures  calculated  over  them,  euch  as  means  and 
variances,  should  be  similar. 
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To  employ  the  basic  technique  of  similarity,  one  first  calculates  a 
vector  of  statistics  for  the  target  area  and  for  each  of  the  candidate  areas. 
The  most  promising  candidate  areas  are  those  which  have  vectors  of  statistics 
similar  to  the  vector  for  the  target  area,  as  determined  by  a  ueighted 
distance  metric.  Then  the  correlation  values  betueen  the  target  area  and 
those  candidate  areas  are  used  to  decide  uhich  promising  candidate  area  is 
the  matching  area. 

Comparing  similarity  to  the  ^andard  method  is  not  as  simple  as 
comparing  gridding  or  reduction.  Ue  can  no  longer  just  count  the  number  of 
correlations  calculated,  since  most  of  the  time  involved  in  using  similarity 
is  spent  doing  things  other  than  correlating. 

Calculating  the  statistics  over  Na  areas  in  picture  Y  uith  sliding 
sums  will  require  time  proportional  to  Na.  The  constant  of  proportionality 
will*  of  course,  depend  upon  how  many  statistics  are  calculated  and  upon  the 
statistics  themselves.  For  instance,  on  the  PDP-10,  it  takes  0.525  ms  per 
pi^el  to  calculate  5  stat  i  st  i  cs— mean  and  variance  of  intensity  and  vector 
mean  (2  components)  and  variance  of  color--for  a  color  image.  It  takes  0.145 
ms  per  pixel  to  calculate  2  stat i st ics--mean  and  variance  of  intensity — for  a 
black-and-white  image. 

Comparing  the  r  statistics  in  the  target  vector  to  the  r  statistics 
in  Na  candidate  vectors  will  require  time  proportional  to  r*Na.  Example:  it 
takes  0.175  ms  to  compute  a  weighted  distance  metric  for  5  statistics  and 
store  the  resulting  distance;  it  takes  0.075  ms  for  2  statistics.  Sorting  n 
distances  to  order  the  areas  by  hou  promising  they  are  requires 
0.070*1 log  n)*(n  +  log  n)  ms.  Finally,  calculating  the  correlations  for  the 
k  most  promising  areas,  using  a  uindow  of  area  kla,  requires  0.065*k*Ua  ms. 
By  comparison,  it  takes  0.065*Na*Ua  ms  to  calculate  all  of  the  correlations 
direct ly. 


To  better  illustrate  the  comparison,  consider  matching  a  21  by  21 
area  out  of  a  150  by  150  picture,  let  k-10,  and  use  the  5  component  vectors 
from  color  images.  For  this  example,  it  would  require  about  G45  seconds  to 
compute  the  correlations  necessary  to  determine  the  match  directly; 
similarity  spends  about  11.8  seconds  calculating  the  vectors,  3.9  seconds 
calculating  the  distances,  25.2  seconds  sorting  them,  and  0.3  seconds 
calculating  the  k  correlations,  for  a  total  of  about  41.2  seconds, 
representing  a  savings  factor  of  about  16. 

As  savings  factors  go,  16  is  neither  trivial  nor  wonderful.  So  far, 
houever,  we  have  implemented  similarity  in  a  brute  force  style  comparable  to 
the  direct  method  for  finding  the  match.  It  is  possible  to  refine  similarity 
in  order  to  make  it  much  more  efficient. 

We  have  pointed  out  before  that  most  images  have  a  local  coherence 
uhich  causes  area-based  measures  such  as  mean  and  variance  to  change  slowly 
as  the  area  center  is  moved  by  one  or  tuo  pixels.  This  means  that  ue  real  ly 
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do  not  need  to  calculate  the  distances  betueen  the  target  area  vector  and 
vectors  for  areas  centered  at  every  point  in  the  second  image.  Ue  can  al  low 
an  area  centered  at  one  point  to  represent  those  areas  centered  at  adjacent 
points  and  apply  heuristic  search  methods. 

For  instance,  one  could  sort  only  those  vectors  of  statistics  uhich 
fail  on  an  m  by  m  grid,  reducing  the  number  of  distances  which  must  be 

calculated  and  sorted  to  (N/m)2.  Then,  from  the  most  promising  k  such  grid 
points,  one  could  hill-climb  in  the  vector  distance  space  until  one  found  the 
most  promising  vectors,  which  would  be  checked  via  correlation  to  determine 
the  match.  Here  ue  spend  the  same  amount  of  time  calculating  the  vectors, 
but  only  0.17S*(N/m)2  ms  calculating  distances, 

0.070*(log2  (M/mi 2 )  *  ( I  og2  (N/m ) 2  +  (N/m) 2  ms  sorting  the  distances, 
0.175*k*m2  ms  calculating  distances  for  *he  hill  climb  of  promising  vectors, 
and  0.065*k*U2  ms  doing  the  correlations  for  these  promising  hill  tops. 

5uppose  that  we  set  N- 150 ,  14-21,  k-10  as  before  and  let  m»10.  As 
before,  we  spend  11.8  seconds  calculating  the  vectors,  0.04  seconds 
calculating  grid  point  distances,  3.13  seconds  sorting  these  distances,  0.18 
seconds  calculating  distances  for  the  hill  climbs,  and  0.30  seconds  doing  the 
correlations  for  these  promising  ni  I  I  tops.  This  is  an  overhead  of  11.8 
seconds,  plus  0.65  seconds  per  match.  For  only  one  match,  this  gives  a 

savings  factor  of  slightly  over  50.  if  the  overhead  is  spread  among  20 
matches,  the  savings  factor  goes  up  to  over  500! 

This  technique  has  not  been  implemented,  however,  because  of  the 

large  amount  of  storage  memory  it  requires.  In  addition  to  the  1502  6-bit 
intensity  values  of  the  second  image  (uhich  amounts  to  3,750  computer  words), 
that  are  needed  for  the  brute  force  correlation  method,  this  method  also 
requires  5*1502  36-bit  numbers  to  store  the  vectors  for  the  second  image. 
This  amounts  to  112,500  additional  words  of  computer  memory,  which  on  most 
systems  is  hard  to  come  by.  Our  speedup  of  a  factor  of  500  is  accompanied  by 
a  very  large  increase  in  the  space  required  to  do  the  job. 

Now,  instead  of  keeping  all  of  our  vectors  of  statistics  from  every 
point,  we  only  keep  them  for  areas  centered  on  an  m  by  m  grid  over  the  second 
picture.  The  most  efficient  way  to  do  this  for  the  yeneral  case  is  still 
with  sliding  sums.  Recording  only  ev^ry  m-th  vector  in  both  directions  means 
that  this  now  takes  0.065*N2  +  0.080*(N/m)2  m6  for  the  black-and-white  and 
0.340*Na  +  0.185*(N/m)2  ms  for  the  color  vectors  described  earlier.  This 

time  we  calculate  (N/m)2  distances  and  sort  them  as  before.  For  the  best  k 
distances,  wc  employ  some  form  of  local  correlation  search  to  cover  that  m  by 
m  area,  which  potentially  holds  the  match. 

For  N-150,  U-21,  k^lG,  and  m-10  as  before,  ue  nou  spend  7.69  seconds 
forming  the  vectors.  For  each  target  area,  ue  spend  0.04  seconds  calculating 
the  distances  and  0.13  seconds  sorting  them.  If  ue  calculate  all  ma 
correlations  for  each  of  the  k  promising  areas,  ue  uill  spend  28.67  seconds 
in  the  correlation  loop.  Employing  gridding  or  some  other  form  of  efficient 
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correlation  search  can  reduce  this  term  significantly.  Realistically,  if  we 
share  the  overhead  among  20  matches  and  do  about  150  correlations  in 
searching  the  most  promising  areas  (see  Illustration  3-3),  then  matching  one 
target  area  will  take  around  4.85  seconds,  a  savings  of  a  factor  of 
approximately  130  over  the  direct  method.  The  extra  soace  required  is  a  mere 
5*152  36-bit  words,  or  1,125  words,  3  reasonable  amount. 

Clearly,  similarity  is  a  very  complicated  technique  whose  relative 
efficiency  depends  on  a  great  number  of  things.  The  overhead  depends  heavily 
on  the  number  and  type  of  statistics  used,  which  will  depend  on  the  data  and 
the  ingenuity  of  the  experimenter  in  using  it.  An  increased  number  of 
complex  statistics  makes  the  overhead  greater  and  increases  the  amount  of 

time  spent  calculating  the  distance  measures.  But,  as  Illustration  3-3 
shows,  having  more  statistics  in  the  vector  can  reduce  the  number  of  areas 
which  look  promising,  hence  the  number  of  correlations  -  which  must  be 
calculated. 

The  type  of  statistics  used  can  affect  the  success  of  similarity. 
Averaging  measures  such  as  mean  and  variance  have  the  advantage  of  being 
quick  and  easy  to  calculate,  fairly  insensitive  to  noise,  and,  as  noted 
before,  usually  insensitive  to  small  changes  in  position.  In  general, 

statistics  that  average  are  prefered  to  those  that  count  or  those  that 
d  i  f  f erence. 

The  calculation  of  tne  distances  for  the  vectors  and  the  sorting  of 
the  vectors  depend  on  the  number  cf  representative  areas,  hence  on  the 
gridding  over  which  the  representative  areas  are  taken.  Too  small  a  gridding 
results  in  a  large  number  of  vectors  to  be  compared  and  sorted;  too  large  a 

gridding  may  let  the  matching  area  go  unrecognized  because  it  fell  between 

two  representative  areas  which  didn  t  resemble  it.  As  with  the  grid  spacing 
for  correlation  gridding,  the  best  way  to  set  this  grid  spacing  is  to  examine 
the  vector  surfaces  for  the  neighborhood  of  the  target  area. 

The  technique  of  similarity  usually  works,  but  not  always.  If  the 
pictures  are  very  homogeneous,  all  areas  will  be  similar,  resulting  in  many 
candidate  areas  to  be  searched  via  correlaticn,  hence  little  savings.  If  the 
pictures  have  much  fine  detail  or  are  noisy,  then  the  candidate  gridding  may 
be  so  fine  that  the  technique  loses  its  usefulness. 

The  p.esence  of  objects  which  have  moved  relative  to  their 
backgrounds  in  the  second  image  may  cause  the  technique  of  similarity  to  fail 
completely  for  some  target  areas.  For  instance,  consider  the  pair  of  areas 
in  the  barn  pictures  (see  Appendix  A  for  the  originals)  which  are  centered  in 
the  trees  to  the  left  of  the  telephone  pole,  and  have  the  pole  itself  in  the 
right  half  of  the  areas.  These  areas  will  match  very  well.  However,  if  it 
should  happen  that  the  representative  area  which  has  its  center  physically 
closest  to  that  of  the  matching  area  contains  a  part  of  the  foreground  post, 
it  will  not  be  similar  to  the  target  area.  Because  that  representative  area 
is  not  similar,  it  will  not  be  searched  and  the  match  will  be  missed. 
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So  far  •* j  have  been  discussing  methods  of  reducing  the  search  which 
do  not  assume  anything  not  directly  contained  in  the  picture  data.  This  was 
the  case  for  our  data;  however,  in  general  we  will  know  somewhat  more  about 
our  pictures.  A  reasonable  design  constraint  on  a  picture-taking  system  is 
that  it  record  how  it  was  oriented  when  it  took  the  pictures.  This 
information  enables  one  to  model  the  relative  positions  and  orientations  of 
the  cameras. 


If  complete  camera  model  information  is  not  known,  as  it  was  in  our 
case,  it  still  is  possible  to  derive  a  workable  model  from  the  pictures 
themselves.  Several  things  are  known  to  be  undec '•  ciab I e  given  just  the 
information  in  the  pictures.  Absolute  position,  for  instance,  is  not 
derivable;  it  requires  external  knowledge  such  as  measurements  made  when  the 
picture  was  taken  or  recognition  of  some  landmark  in  the  picture.  Likewise, 
it  is  impossible  to  say  exactly  how  large  or  how  far  away  a  given  object  is 
without  measurements  or  landmarks  to  establish  scale. 


It  is  possible,  however,  to  derive  relative  positions  aid  relative 
sizes  for  objects  in  the  pictures.  This  is  done  by  assigning  an  arbitrary 
position  and  orientation  to  one  of  the  cameras  and  by  fixing  some  distance, 
such  as  the  distance  between  the  cameras.  With  these  hypotheses  and  a 
suitable  number  of  point-pair  matches  derived  by  the  previously  mentioned 
techniques,  the  relative  orientations  of  the  cameras  and  positions  of  objects 
which  appear  in  both  pictures  can  be  calculated. 


Theoretically,  if  one  has  N  unknouns  in  the  camera  model  and  N 
constraints  in  the  form  of  matching  point  pairs,  one  can  obtain  a  closed  form 
solution  for  the  camera  model.  In  practice,  the  constraining  equations  dc 
not  usually  permit  analytical  solution.  Therefore,  a  more  common  technique 
is  to  approximate  the  unknowns  by  least-squares  techniques,  either  in  closed 
form  or  by  numerical  i..ethods.  By  either  method,  one  needs  at  least  N/2  point 
pairs.  The  locations  of  these  pcint  pairs  within  the  images  and  the  location 
in  3-space  of  the  points  they  represent  is  important.  If  these  point  pairs 
are  concentrated  in  one  area  of  the  image  or  if  they  represent  3-d  i  mens  i  ona  I 
points  whi  i  are  all  coplanar,  then  N/2  point  paii  i9  not  sufficient.  For 
numerical  least-squares  approximation  of  the  camera  model  parameters,  the 
author  likes  to  have  at  least  twice  as  many  point  pairs  as  there 
parameters  to  be  derived,  and  to  have  these  pairs  uell  distributed 
i mage3. 


are 
in  both 


Several  different  approaches  have  been  taken  to  the  problem  of 
leriving  camera  models  from  picture  information.  (See,  e.g.  (Sobel,  1970]) 
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Since  this  author  was  faced  with  pictures  for  which  no  camera  model  was  given 
and  since  no  available  model  derivation  code  was  applicable,  yet  another 
camera  model  derivation  method  has  been  developed. 

This  author's  approach  is  based  on  searching  for  the  camera  model 
which  minimizes  a  least-squares  measure  of  camera  model  error.  Each  pair  of 
matching  areas  is  first  characterized  as  a  pair  of  points— the  centers  of  the 
areas.  For  every  proposed  camera  model  in  the  search,  each  pair  of  points  is 
placed  on  the  image  planes  of  the  cameras,  and  the  -ays  from  the  principal 
points  of  the  cameras  through  these  image  plane  points  are  calculated.  The 
error  is  a  function  of  how  close  these  pairs  of  rays  come  to  each  other  in 
3-space,  normalized  by  the  mean  distance  to  the  point  of  approach.  (A 
mathematical  explanation  of  this  measure  appears  in  Appendix  C. ) 

This  author,  not  being  a  numerical  analyst,  implemented  a  very 
unsophisticated  function  minimlzer  to  search  for  the  best  c£mera  model  for  a 
given  set  of  points.  That  program  showed  that  the  technique  would  work,  but 
was  slow  and  unreliable.  The  calculations  presented  in  Appendix  C  have  since 
been  re-progi  ammed  by  another  student,  Donald  Gennery,  whose  program  works 
very  reliably  and  quite  fast.  It  is  his  program  which  has  derived  most  of 
this  author's  camera  models. 

For  the  purpose  of  limiting  the  search  space,  it  matters  not  whether 
the  camera  model  is  given  or  derived.  The  existence  of  a  camera  model  makes 
possible  another  search-reduction  technique. 

With  a  camera  model,  it  is  possible  to  constrain  the  search  for  the 
matching  area  to  a  line  in  the  second  image.  To  do  this,  the  target  is 
characterized  by  a  point,  usually  its  center  of  mass.  This  point  is 
projected  through  the  first  camera  as  a  ray  in  3-space.  The  3-dimensional 
point  corresponding  to  the  original  point  in  the  image  plane  must  lie  n  this 
ray.  The  ray  is  now  back-projected  into  the  second  camera  becoming  a  line 
segment  on  the  second  image  plane  (whose  exact  equation  is  derived  in 
Appendix  C).  Since  the  3-dimensional  point  was  on  the  ray,  its  projection 
into  the  second  image  plane  must  lie  on  this  line  segment. 

Uith  this  knowledge,  it  is  no*  necessary  to  search  the  entire  picture 
for  a  match;  searching  along  the  line  segment  ui  I  I  suffice.  Illustration  3-4 
shous  for  tuo  different  areas  of  the  barn  pictures  a  target  area  in  the  first 
Image,  its  center  point,  the  line  which  this  point  projects  to,  and  the 
matching  area  found  by  searching  along  the  line  segment.  This  technique 
reduces  the  search  space  in  an  NxN  picture  from  the  N2  candidate  ar.eas 
centered  at  the  points  of  the  picture  to  the  N  or  feuer  candidate  areas 
centered  on  the  points  of  the  line  segment.  Performing  a  one-dimensional 
analog  of  the  technique  of  gridding  along  the  line  can  result  in  an 
additional  savings  of  a  factor  of  m,  the  grid  spacing. 

Techniques  involving  camera  models  will  work  whenever  a  camera  model 
exists,  but  their  efficiency  in  reducing  the  search  depends  on  the  accuracy 


of  the  camera  model.  An  exact  camera  model  will  give  the  line  exactly.  A 
moderately  inaccurate  camera  model  u i II  usual  ly  put  the  line  in  the  right 
area,  although  some  local  searching  may  be  necessary.  The  better  the  model, 
the  smaller  the  local  search. 


UORLD  HOOELS 


If,  in  addition  to  a  canera  model,  there  exists  a  model  for  the 
nor  Id,  then  it  is  possible  to  preoict  precisely  where  the  center  point  of  the 
matching  area  will  be.  The  ray  from  the  first  camera  will  intersect  the 
world  model  at  a  3-dimensional  point  which  can  be  back-projected  into  the 
second  camera,  giving  the  center  point  of  the  predicted  match. 

Even  a  fragmentary  world  model  can  reduce  the  search  significantly. 
For  instance,  knowledge  of  the  position  of  the  ground  p'ane  limits  the  depth 
at  which  an  object  can  lie  IFalk,  .°9691 .  Thus  the  matching  center  point  is 
constrained  to  lie  on  that  part  of  the  back-projected  line  segment  between 
the  points  which  represent  the  camera  and  the  ground  plane. 

If  the  world  model  is  not  given,  it  is  still  possible  to  derive  it 
from  the  matched  area  pairs.  However,  derived  world  models  are  more  often 
the  result  of  the  ratching  process,  not  the  means  for  its  improvement. 

One  trivial  sort  of  world  model  which  do*-s  not  require  a  camera  model 
(although  it  can  be  used  with  one)  is  the  continuity  assumption.  It  consists 
merely  of  assuming  that  if  areas  A  and  P  are  adjacent  In  the  first  image, 
then  their  matches  uill  be  adjacent  in  the  second  image.  This,  of  course, 
reduces  the  search  space  considerably— to  the  immediate  neighborhood  of  the 
laot  match  found. 

Hou  effective  the  use  of  uorld  models  is  depends  on  the  accuracy  of 
the  model.  Small  errors  in  the  model  may  make  little  difference  in  the 
predicted  position  of  the  match.  Large  errors,  like  assuming  continuity  near 
a  depth  discontinuity  uill  cause  no  match  to  be  found.  In  th'19  case,  a 
retreat  must  be  made  to  one  of  the  more  general  techniques  of  matching. 

Each  of  the  techniques  described  in  this  chapter  results  in  a  fairly 
large  savings  uhen  matching  areas  in  stereo  images.  Combining  two  or  more  of 
them  increases  the  savings.  The  author  has  had  excellent  success  uith 
programs  combining  reduction,  similarity,  and  gridding  and  uith  ones 
combining  reduction,  camera  models,  and  gridding.  (See  Chapter  6  for 
descriptions  of  some  of  the  programs.)  The  author  has  not  implemented  any  of 
the  uorld  models  save  the  continuity  assumption  (see  Chapter  5),  but  the 
Hand-Eye  group  at  Stanford  A.  I.  uses  the  ground  plane  model  to  good 
advantage,  and  Bruce  Baumgart  (unpublished  research,  has  done  some  work 
with  exact  uorld  models. 
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LAB  PICTURES,  (  45,  65)  .  (  51,  71) 


IPS  PICTURES,  (  45,  65)  -  <  51,  71) 


Illustration  3-2.  The  top  row  contains  graphs  of  CIK,  Ix,Jx;Y,  ly+dl,  Jy) 
against  dl  and  C(X, Ik, Jxj Y, Ig, Jy+dJ)  against  dJ,  as  before.  Bottom  row 
contains  graphs  of  C(X,.Ix,  JxjX,  Ix+dl ,  Jx)  against  dl  and  C(X,  lx,  Jx;X,  lx,  Jx+dJ) 
against  dJ,  i.e.  the  autocorrelation  cross  sections  for  the  same  target  area. 
Like  these,  most  match  correlation  closely  resemble  their  autocorrelation 
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Illustration  3-3.  Tabulated  results  of  correlation  searches  using  the  search 
reduction  technique  of  similarity  uith  correlation  grldding  In  the  promising 
areas.  The  first  tuo  columns  give  the  area  center  in  the  first  Lab  picture? 
the  second  tuo  tell  hou  many  promising  areas  Here  found  and  how  many 
correlations  uere  calculated  for  the  b I ack-and-uhl te  vectors  described  in  the 
tent;  the  third  tuo  give  the  number  of  promising  areas  and  correlations 
calculated  for  the  color  vectors  described.  The  color  vectors  are  usually 
uorth  the  increased  time  needed  to  calculate  them.  Either  set  of  vectors 
results  in  a  significant  reduction  in  the  amount  of  time  needed  to  find  the 
match. 


Illustration  3-*.  Two  pairs  of  barn  pictures,  shouing  camera  model  searches 
for  a  match.  In  each  pair,  the  first  image  shous  the  target  areas  uith  their 
central  points:  in  the  second  image  the  line  to  which  the  center  point 
projects  is  shown,  along  with  the  matching  area  and  its  center. 


Chapter  4 


UNMATCHABLE  TARGET  AREAS 


Careful  analysis  of  the  techniques  discussed  so  far  will  show  that, 
in  addition  to  the  assumptions  stated  in  Chapter  1,  ue  have  been  making  one 
other,  unstated  assumption.  Ue  have  assumed  that  there  existe  eome 
uindou-based  algo-ithm  by  which  all  target  areas  can  be  matched. 

Unfortunately,  there  are  entire  classes  of  target  areas  which  do  not 
fit  this  assumption,  i.e.,  which  require  global  techniques  to  determine  uhich 
area  is  their  intuitive  match.  These  fall  into  two  major  groups — those  which 
can  be  detected  before  matching  i=  attempted  and  thosu  uhich  come  to  light 
only  when  matching  fails. 


DETECTABLE  BAD  TARGETS 


The  first  group  of  unmatchabie  target  areas  are  those  containing  data 
uhich  is  by  its  very  nature  unmatchabie.  These  unmatchabie  areas  can  be 
detected  before  matching  is  attempted  by  examining  the  target  data. 


Low  I  njf  or ma  t  i  oji 

Uhen  the  target  area  contains  little  or  no  information,  matching  that 
area  is  impossible  by  area-based  measure-of-match  techniques.  For  example, 
consider  a  window  taken  out  of  the  cloudless  sky  of  the  barn  pictures.  Baeed 
on  just  the  information  in  that  uindow,  it  is  impossible  to  saij  precisely 
which  piece  of  sky  in  the  second  image  matches  this  area.  In  the  absence  of 
noise,  a  low-information  target  area  ui  I  I  match  almost  any  low-information 
candidate  area,  for  there  is  nothing  in  the  target  to  distinguish  uhich 
candidate  it  really  matches. 

An  area  of  low  information  is  an  area  of  low  variance.  This  le 
perhaps  the  most  computationally  expedient  way  of  determining  whether  or  not 
an  area  has  sufficient  information  to  be  matchable.  In  the  presence  of 
noise,  this  technique  may  fail,  e!nce  noise  creates  variance.  In  this  case, 
eome  other  test,  such  as  the  presence  of  an  edge,  should  be  used. 

An  area  of  low  information  ui  1 1  also  have  an  autocorrelation  peak 
which,  except  for  a  value  of  1.0  at  zero  displacement,  will  be  almost  flat. 
(Illustration  4-1  shows  the  correlation  and  autocorrelation  graphs  for  an 
area  of  the  sky  in  the  barn  pictures.)  This  flatness  can  be  recognized  by 
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inspection  of  the  peak,  or,  if  more  precise  determination  is  desired,  a 
bivariate  normal  distribution  surface  [Freund,  1962)  can  be  fitted  to  the 
peak  and  the  parameters  of  the  curve  examined.  Any  area  having  a  very  flat 
autocorrelation  peak  is  unsuitable  for  matching. 


Linear  Edge 

When  the  target  area  has  a  single  linear  edge  across  it  with  little 
or  no  information  on  either  side  of  this  edge,  matching  is  very  difficult. 
An  attempt  to  match  such  a  target  mil  show  that  the  area  matches  quite  well 
with  candidate  areas  all  along  the  edge  in  the  second  image. 

This  condition  is  observable  in  the  autocorrelation  peak.  (See 
Illustration  4-2)  If  one  fits  a  bivariate  normal  distribution  surface  to  the 
autocorrelation  and  examines  the  contours  of  this  surface,  one  discovers  that 
the  peak  is  r-aily  a  ridge  aligned  with  the  edge. 

If  we  use  only  the  information  in  the  target  area,  there  is  nothing 
to  resolve  which  candidate  along  the  edge  is  the  real  match.  Target  areas 
displaying  this  property  must  be  regarded  as  unmatchable  unless  further 
information,  such  as  a  camera  model  or  a  set  of  other  matches  to  tie  to,  le 
avai I  able. 

Pre-processing  of  a  target  area  to  determine  uhether  or  not  it  ie 
suitable  for  matching  is  expensive.  Houever,  if  one  compares  this  expense 
with  the  expense  of  searching  futilely  for  a  match,  such  pre-processing 
becomes  worthwhile. 


TARGETS  WHICH  00  NOT  HATCH 


The  second  group  of  unmatchable  target  areas  are  those  whose 
counterparts  simply  do  not  exist  in  the  second  image,  due  to  relative  motion 
between  the  camera  and  part  or  all  of  the  scene.  Such  unmatchabi  I  i  t  ies 
cannot  be  detected  by  examining  either  picture  alone,  but  are  discovered  cnly 
after  the  expense  of  attemp  ing  to  match  has  been  incurred.  Since,  in  this 
case,  the  target  area  has  no  proper  match,  the  candidate  area  having  the 
highest  correlation  will  b'i  an  incorrect  match.  It  is  desirable  to  be  able 
to  detect  these  incorrect  matchings  as  they  occur. 

If  two  areas  do  not  match,  the  correlation  between  them  should  be 
low.  It  seems  reasonable,  therefore,  to  detect  bad  matches  by  seeing  If  the 
best  correlation  obtained  uas  too  low.  Hatters  are  complicated  by  the  fact 
that  some  good  matches  have  low  correlations.  In  fact,  for  almost  any  pair 
of  pictures  and  fixed  threshold,  it  is  possible  to  find  either  a  target  area 
for  which  there  is  a  bad  match  with  a  correlation  above  that  threshold  or  a 
target  area  whose  proper  match  has  a  correlation  below  the  threshold. 
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So,  how  does  one  distinguish  betueen  good  matches  with  low 
correlations  and  bad  matches?  As  previously  stated,  the  correlation  peak  for 
a  proper  match  should  very  closely  resemble  the  autocorrelation  peak  for  the 
target  area.  In  particular,  if  we  have  restricted  our  target  areas  to  those 
with  distinct  autocorrelation  peaks,  a  flat  or  chaotic  correlation  peak  is  an 
indication  of  a  questionable  match. 

The  fact  that  the  correlation  and  autocorrelation  peaks  should  be 
similar  can  be  used  to  derive  an  autocorrelation  threshold  for  the  match 
correlation.  By  examining  the  autocorrelation  surface  at  points  near  the 
summit  of  the  autocorrelation  peak,  it  is  possible  to  predict  what  the 
correlation  should  be.  (See  Appendix  B.)  Any  match  below  this 
autocorrelation  threshold  is  highly  suspect. 

Of  course,  global  information,  such  as  continuity  from  neighboring 
points  can  also  be  used  to  determine  the  credibility  of  a  match. 


NON-UNIQUE  HATCHINGS 


A  related  problem  is  that  of  multiple  matches.  Since  we  have  not 
specifically  limited  the  subject  matter  of  our  pictures,  it  is  possible  that 

more  than  one  of  some  object  can  appear  in  the  pictures.  If  several  of  these 

objects  appear  against  similar  backgrounds,  a  target  area  can  quite 
reasonably  have  not  one  but  several  matches. 

If  several  areas  match  the  target  area,  they  can  be  expected  to  all 
have  about  the  same  correlation.  If  they  are  good  matches,  all  of  them 

should  be  greater  than  the  autocorrelation  threshold  for  the  target  area. 
Therefore,  to  detect  multiple  matches,  one  checks  to  see  if  there  is  more 
than  one  correlation  above  the  autocorrelation  threshold.  If  so,  one  checks 
hou  uell  they  group.  If  the  top  few  correlations  above  the  autocorrelation 
threshold  are  roughly  the  same  in  value,  multiple  matches  are  indicated. 
This  can  be  confirmed  by  checking  to  see  if  the  multiple  candidates  correlate 
uell  with  each  other. 

If  only  the  information  in  the  areas  is  present,  an  area  with  more 
than  one  match  indicated  is  no  more  useful  than  an  area  with  no  match 

indicated,  since  in  neither  case  has  the  location  of  the  match  been 
determined.  Additional,  irore  global  information  in  the  form  of  a  camera 
model  or  other  matches  to  tie  to  can  be  used  to  resolve  the  ambiguity. 


UHAT  TO  DO  UHEN  A  TARGET  UON’T  HATCH  PROPERLY 


For  some  of  the  target  areas  which  won't  match  properly  using  measure 
of  match  techniques,  there  is  nothing  that  can  be  done.  Target  areas  whose 
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matches  fall  out  of  the  field  of  view  of  the  second  camera  are  clearly  in 
this  class.  Target  areas  of  low  information  cannot  be  matched  reliably, 
therefore  are  assigned  to  this  class.  Target  areas  containing  distortion  due 
to  perspective  change  by  definition  do  not  have  matches,  therefore  are  also 
assigned  to  this  class.  The  only  reasonable  thing  to  do  with  targets  of 
these  varieties  is  to  give  up  on  them. 

Other  types  of  unmatched  target  areas  may  be  matchable  by  some 
different  algorithm,  probably  utilizing  more  global  information.  If,  for 
instance,  we  are  employing  the  similarity  heuristic,  and  it  fails  for  some 
reason,  it  may  be  that  pure  gridding  will  find  the  match.  Ambiguous  matches 
and  linear  edges  between  areas  of  low  information  (which  can  be  thought  of  as 
extended  ambiguities)  can  usually  be  resolved  by  algorithms  which  employ 
additional  information,  such  as  a  good  camera  model. 

Having  a  camera  model  enables  one  to  find  the  line  segment  in  the 
second  image  which  corresponds  to  the  center  point  of  the  turget  area.  If 
one  of  the  proposed  candidate  area*,  has  a  center  point  that  lies  within  one 
pixel  of  this  line  segment,  then  the  match  is  resolved.  This  algorithm  fails 
if  more  than  one  proposed  candidate  lies  within  one  pixel  of  the  magic  line 
segment,  ie.  if  two  or  more  of  the  nominated  objects  are  approximately 
coplanar  with  the  two  camera  principal  points.  This  is  a  fairly  common 
occurrence,  since  a  man-made  world  containing  identical  objects  is  likely  to 
have  these  objects  on  a  flat  surface. 

The  presence  of  a  set  of  other  matches  can  also  be  used  to  resolve 
ambiguities.  The  target  area  will  have  some  spatial  relationship  to  the 
target  areas  of  the  sets  the  match  is  the  proposed  candidate  whicl  most 
closely  approximates  this  relationship  with  the  candidate  areas  of  the  match 
set  [Fischler  and  Elschlager,  19711.  Of  course,  care  must  be  exercised  in 
the  choice  of  the  set  of  points.  If,  for  instance,  one's  anchor  points  are 
ail  in  the  foreground  in  the  barn  pictures,  and  one  is  try  ng  to  match  the 
fence  posts  across  the  field,  one  will  get  a  meaningless  answer.  The  anchor 
point  pairs  u'.ed  should  be  at  the  same  depth  as  the  target  area  to  guarantee 
correct  results. 

In  the  casr  of  depth  discontinuities,  one  could  employ  edge 
techniques  [Hueckel,  1969)  to  segment  the  target  area  into  regions.  These 
irregular  areas  could  then  have  matching  attempted  on  each  of  them 
separately,  using  masked  correlation  or  pointer  correlation.  (See  Appendix 
B.) 


Various  methods  exist  for  handling  individual  unmatchable  target 

areas.  In  each  case,  it  is  first  necessary  to  determine  which  variety  of 

unmatchabi I  I ty  one  has,  then  apply  the  proper  method.  Quite  often,  this  is 

done  by  the  experimenter  peeking;  that  Is,  the  experimenter  figures  cut  what 
kind  of  unmatchabi  I i ty  he  has  and  tells  the  "algorithm"  what  to  do. 

This  author  has  found,  however,  that  the  best  thing  to  do  with  an 
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unmatchable  target  area  is  to  give  up  on  it  and  try  a  different  target  area. 
Eventually,  target  areas  that  have  good  matches  will  come  along.  (If  not] 
the  exper Imen'ar  SHOULD  peek  to  see  If  he  has  the  right  two  pictures!)  With 
good  matches,  the  technique  of  region  growing  becomes  applicable.  Most  of 
the  problems  rela'ed  to  unmatchable  areas  can  be  solved  or  greatly  simplified 
by  the  use  uf  region  gi owing. 


BARN  PICTURES,  (  85,  25)  -  (181,  IS. 


BARN  PICTURES,  (  85,  25)  •  (1(1,  19) 


Illustration  4-1,  Correlation  and  autocorrelation  graphs  for  an  area  of  low 
inform-  'on,  shouing  the  two-dimensional  flatness  of  such  peaks. 


Chapter  5 


EXTENDING  HATCHES 


In  Chapter  3,  we  mentioned  the  continuity  assumption  as  a  crude  form 
of  world  model  which  greatly  shortened  most  searches  for  a  match  when  there 
was  an  adjacent  match  available.  This  continuity  assumption  forms  the  basis 
for  the  technique  of  extending  matches. 


REGION  GROUINGs  THE  BASIC  TECHNIQUE 


Under  the  continuity  assumption,  if  the  target  area  centered  at 
(Ix,Jx)  matches  the  candidate  area  centered  at  (Iy.Jy),  then  one  uojld  expect 
the  four  adjacent  target  areas  (Ix+l,Jx),  (lx-l,Jx),  (Ix,Jx+l),  a^j  (Ix,Jx-l) 
to  match  the  four  adjacent  candidate  areas  (ly+l,Jy),  (ly-l,Jy)  (ly,Jy+l), 
and  ( I y , Jy-1 ) .  respectively.  If  (Ix.Jx)  matches  (Iy,Jy),  then  the 
correlation  betueen  these  two  areas  represents  the  peak  of  the  correlation 
surface  and  is  greater  than  the  autocorrelation  threshold  for  (Ix,Jx) 
mentioned  in  Chapter  4  and  described  more  thoroughly  in  Appendix  B.  If  the 
four  adjacent  expected  matches  are  indeed  matches,  then  each  of  them  should 
meet  this  same  criterion.  Once  one  of  the  expected  matches  meets  the 
criterion,  then  the  paired  areas  adjacent  to  it  become  expected  matches, 
etc.,  and  a  region  of  constant  (dI,dJ)  -  Uy,Jy)  -  (Ix.Jx)  is  grown. 

Expressed  more  formal lyj  given  a  criterion  for  judging  uhether  or  not 
a  point  belongs  uithin  a  region  and  at  least  one  point  at  uhich  that 
criterion  is  met,  the  following  algorithm  extends  the  region. 

1.  Push  onto  the  stack  at  least  one  point  which  meets  the  criterion. 

2.  Pop  one  point  off  of  the  stack  and  examine  the  points  lying  above,  below, 

right,  and  left  of  it.  Examining  a  point  consists  of  first  checking 
to  see  if  it  is  marked  as  having  been  processed:  if  so,  nothing 
further  is  done  to  it.  Otherwise,  if  it  meets  the  criterion  of  the 
region  being  grown,  then  it  is  marked  GOOD  and  pushed  onto  the  stack, 
else  it  is  marked  BAD  and  not  pushed. 

3.  Continue  step  2  until  th«,  stack  is  empty. 

Marking  the  points  not  only  leaves  behind  a  record  of  which  are  good 
and  which  are  bad  matches,  but  also  keeps  the  algorithm  from  repeating  work 
uhich  has  already  been  done.  Since  there  are  only  a  finite  number  of  point'* 
available  to  try,  this  avoidance  of  repeated  uork  guarantees  that  the 
algorithm  will  terminate. 
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EXPEDITING  REGION  GROWING 


As  its  criterion  for  a  match  having  occurred,  the  preceding  algorithm 
uses  the  fact  that  we  are  at  a  correlation  peak  and  that  the  maximum 
correlation  is  greater  than  the  autocorrelation  threshold.  For  each  match 
pair,  this  requires  ten  correlations— nine  to  determine  if  the  expected  match 
is  indeed  a  correlation  peak  and  one  to  calculate  the  autocorrelation 
threshold. 


In  practice,  eight  of  the  nine  correlations  are  not  usually  needed. 
The  autocorrelation  threshold  is  derived  from  expected  values  of  the 
correlation  surface  at  one  pixel  displacement  from  the  match.  In  most  caees, 
the  actual  correlation  at  one  pixel  displacement  ie  lower  than  the  expected 
correlation  at  that  displacement,  so  that  the  only  part  of  the  correlation 
surface  which  lies  above  the  autoccrre  I  at  i  on  threshold  is  the  match  peak 
itself.  Testing  to  see  that  the  correlation  Is  greater  than  the 
autocorrelation  threshold  is  usually  a  sufficient  criterion  for  determining 
whether  or  not  the  expected  match  is  indeed  a  match. 

The  correlation  between  the  proposed  matching  areas  and  the 
autocorrelation  threshold  for  the  target  area  still  need  to  be  calculated. 
These  two  measures  each  require  covering  the  target  area  while  forming  sums. 
If  the  sums  for  both  measures  are  calculated  together  in  one  pass  over  the 
data,  the  target  area  need  only  be  covered  once,  rather  than  tuice.  Thus  the 
combination  of  the  correlation  and  autocorrelation  will  take  about 
three-quarters  of  the  time  necessary  for  calculating  both  separately,  or 
approximately  1.5  times  as  long  as  an  ordinary  correlation. 

This  is  effectively  the  optimum  technique  for  determining  a  match. 
It  requires  only  1.5  correlations,  as  opposed  to  Na  correlations  for  the 
direct  method,  a  savings  of  a  factor  of  N3. 

EXTENDING  MATCHING  REGIONS 

In  our  revised  algorithm,  an  area  center  would  be  marked  BAD  if  its 
correlation  were  not  greater  than  its  autocorrelation.  For  such  points,  the 
pair  of  areas  may  or  may  not  represent  a  correlation  peak. 

If  the  pair  of  areas  does  not  represent  a  correlation  peak,  *he- 
continuity  assumption  need  not  have  bee  i  violated.  It  could  well  be 
this  particular  part  of  the  scene  is  cortinuous,  but  that  the  normal  to  the 
surface  is  at  a  moderate  angle  to  the  camera  principal  axes.  Thie  can  cause 
a  gradual  change  in  (dl.dJ)  as  one  movus  across  the  picture.  If  this  ie  the 
case,  then  a  short  local  search  should  reveal  the  correlation  peak  which 
represente  the  match.  For  this  purpose,  using  one  "loop”  of  the  spiraling 
search  eubroutine  MATCH,  described  in  Appendix  B,  worke  quite  well. 
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Once  the  peak  is  found,  it  may  or  may  not  pass  the  autocorrelation 
threshold.  If  it  does,  then  this  neu  pair  of  (Ix,Jx)  and  (Iy,Jy)  becomes  a 
candidate  for  the  application  of  the  region  grouing  algorithm,  and  the  region 
continues  to  expand.  Illustration  5-1  shows  one  of  the  results  of  this 
extended  region  grower. 

Any  pair  of  areas  that  represents  a  correlation  peak  but  does  not 
pass  the  autocorrelation  test  remains  unmatched  for  the  present,  since  in 
theory  that  target  area  has  a  match  elseuhere,  which  a  later  region  grouing 
will  locate. 


HOU  REGION  GROUING  SOLVES  THE  PROBLEMS 

In  Chapter  4,  we  promised  that  region  growing  would  solvs,  or  at 
least  simplify,  most  of  the  problems  encountered  in  matching.  Ue  divided  the 
unmatchable  areas  into  two  categor ies--those,  such  ae  ambiguities  and  depth 
discontinuities,  which  could  be  matched  or  partially  matched  by  special  means 
and  thoee  which  simply  had  no  match,  whether  due  to  obscurations, 

distortions,  or  changes  in  the  field  of  view.  The  problem  was  that,  except 
for  ambiguities,  we  had  no  way  of  telling  which  variety  of  unmatchabi  I  i  ty  a 
given  target  area  might  be.  If  a  given  target  wouldn't  match,  "peeking"  was 
the  only  way  of  telling  whether  the  area  wae  a  depth  discontinuity  which 
should  be  segmented  or  an  obscuration  which  should  have  no  further  time 
wasted  on  it.  Region  grouing  from  a  feu  good  matches  spread  about  the 
picture  helps  here. 

Suppose,  for  instance,  a  target  area  which  previously  failed  to  match 
now  falls  within  a  region  of  grown  matches.  If  the  target  failed  to  match 
because  of  an  ambiguity,  whether  one  caused  by  multiple  objects  or  a  linear 
edge,  thie  ambiguity  has  been  resolved.  If  the  target  area  didn't  match 

because  of  a  failure  of  the  heuristics,  the  difficulty  has  now  been 
surpassed. 

Suppose  the  unmatched  target  lies  just  outside  of  a  groun  region.  If 
target  areae  leading  up  to  the  unmatched  target  should  match  candidate  areas 
leading  to  the  edge  of  the  image,  then  the  intuitive  match  for  our  unmatched 
target  area  falls  out  of  the  field  of  vieu  of  the  second  camera.  In  a 
similar  fashion,  an  unmatched  target  whose  intuitive  match  has  been  obscured- 
can  now  be  detected;  target  areas  leading  up  to  the  unmatched  target  wi  I  I 
match  candidates  that  lead  into  a  region  of  candidates  having  a  different 
matching  (dI,dJ) — that  of  the  obscuring  object. 

If  the  unmatched  target  lies  in  the  midst  of  a  "hole"  in  a  grown 

region,  then  a  moving  object  which  hae  disappeared,  such  as  the  man  on  the 

eteps  in  the  lab  pictures,  is  indicated.  If  the  unmatched  target  liee  near 
the  edger  of  two  grown  regions  with  rather  different  matching  IdI,rJ),  then 
chances  are  that  the  unmatched  target  contains  the  depth  discontinuity 
between  these  two  regione. 
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For  most  parts  of  most  allowable  pairs  of  stereo  Images,  the 
continuity  assumption  holds,  so  region  growing  can  usually  match  almost  all 
of  the  areas  of  most  pairs  given  just  a  feu  "starter"  matches.  For  example, 
all  of  the  matchable  area  of  the  lab  pictures  can  be  groun  from  one  match  in 
the  background;  in  the  canyon  pictures,  three  matches  are  required — one  on 
the  background  canyon  uall,  one  on  the  foreground  promontory,  and  one  on  the 
pinnacle  at  the  right. 

Because  of  ths  area-based  nature  of  matching,  region  grouing  stops 
when  the  area  reaches  a  depth  d  scontinuity  or  touches  a  distorted  region. 
In  the  finished  products,  such  as  Illustration  5-1,  what  is  displayed  is  the 
outer  line  of  center  points  which  ths  region  grower  found  not  to  match. 
Consequently,  these  products  do  not  precisely  outline  depth  discontinuities 
or  areas  of  distortion,  but  fall  U  pixels  away  from  these  edges,  uhere  U  is 
the  area  radius.  However,  if  one  is  willing  to  iterate  around  the  edges 
using  smaller  and  smaller  values  of  W,  then  closer  and  closer  approximations 
of  these  outlines  can  be  found  (Levine,  19731. 

Thus  we  see  that  region  grouing  not  only  makes  it  easy  to  distinguish 
what  type  of  unmatchabi  I  i  ty  one  has,  but  also  doss  what  matching  or  partial 
matching  is  needed.  This  is  why  ws  claimed  that  region  growing  would  solve 
or  simplify  all  of  the  problems  attendant  to  unmatchabi I i t ies. 


GROUING  UNIFORM  REGIONS 


Indeed,  match  extension  region  growing  helps  with  all  of  the 
unmatchable  areas  save  those  due  to  low  information.  As  we  noted  in  Chapter 
4,  areas  of  low  information  tend  to  be  areas  of  low  variance.  Once  such  an 
area  has  been  located  in  the  first  image,  ths  technique  of  region  growing  can 
bs  used  to  mark  that  region  so  that  future  attempts  at  matches  can  be 
forewarned  of  ths  condition. 

For  this  application,  the  region  growing  algorithm  presented  in  this 
chapter  need  only  be  modified  slightly.  As  its  criterion  for  a  good  point, 
the  uniform  region  grouer  uill  use  ths  fact  that  the  variance  over  the  area 
centered  at  that  point  is  below  a  given  threshold.  Thus  instead  of  comparing 
areas  out  of  tuo  images  and  continuing  growth  if  they  match,  we  are 
evaluating  an  area  in  a  single  picture  and  grouing  If  that  area  Is  of  low 
var i ance. 

As  Illustration  5-2a  shows,  uniform  regions  groun  by  this  method  will 
stop  a  bit  short  of  their  edges,  since  any  point  uhoee  area  touches  the  edge 
will  have  a  higher  variance,  thus  be  rejected.  Whether  this  is  bad  or  good 
depends  on  uhsther  the  user  wanted  to  delimit  the  entira  uniform  region  or 
only  that  part  of  it  uhich  had  too  little  information  to  match  upon. 

If  the  desired  effect  was  that  of  Illustration  5-2b  then  a  somewhat 
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different  criterion  needs  to  be  employed.  Low  variance  means  that  the 
average  squared  difference  between  the  intensity  at  a  pixel  and  the  mean 
Intensity  over  the  area  Is  small.  For  an  area  to  have  a  small  variance,  mo6t 
of  thsss  differences  at  individual  pixels  must  be  small.  Hence,  we 
substitute  into  the  uniform  region  growing  algorithm  the  criterion  that  the 
absolute  difference  between  the  Intensity  at  a  point  and  the  mean  intensity 
over  the  uniform  region  be  small. 

Whether  the  mean  intensity  is  taken  over  all  of  the  region  grown  so 
far  or  only  over  a  local  part  of  the  region  depends  on  whether  the  user 
wishes  the  uniform  region  grower  to  stick  strictly  to  a  particular  intensity 
or  allow  it  to  follow  shading,  or  to  allow  it  to  follow  gradual  changes  in 
intensity  or  color,  such  as  occur  in  a  clear  summer  sky.  How  small  the 
absolute  difference  in  intensities  must  be  at  each  point  is  based,  on  how  much 
variation  is  expected  (or  desired)  within  the  area  to  be  grown,  and  can 
either  be  a  constant  or  a  statistical  measure,  such  as  a  multiple  of  the 
standard  deviation  of  the  intensities  within  the  area.  Uhich  uniform  region 
grouer  one  uses,  of  course,  jpends  upon  the  effect  uhich  the  user  wishes  to 
produce. 


Illustration  5-1.  Tuo  pairs  of  pictures  with  overlays  to  shou  regions 
delimited  by  the  extended  region  grower.  In  the  barn  pair,  the  foreground 
post  has  been  outlined:  in  the  canyon  pair  the  nearest  spine  of  the 
foreground  promontory  is  shown.  Each  of  these  regions  consists  of  several 
cub-regions  at  sightly  different  displacements. 
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Illustration  5-2.  Uniform  regions  delimited  by  the  region  grower.  Part  (a) 
shows  regions  grown  by  the  var  iance-over-a-windou  methodi  part  (b)  shows  the 
same  regions  grown  by  the  deviat ion- from- the-mean  method. 
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Chapter  6 


ALGORITHMS  AND  EXAMPLES 


So  far,  we  have  presented  a  variety  of  techniques,  mentioning  only 
briefly  how  they  might  be  used.  In  this  chapter,  we  discuss  algorithms  which 
use  these  techniques  and  give  examples  of  their  results. 


INDIVIDUAL  MATCHES 


Sets  of  individual  matches  can  be  used  for  a  variety  of  things.  They 
can  bo  used  to  align  data  for  furtoer  processing  such  as  differencing  [Quam, 
1971].  They  can  be  used  to  derive  camera  models  (see  Appendix  C).  Uith  a 
camera  model,  a  pair  of  matching  points  can  be  used  to  determine  the  relative 
depth  to  an  object  in  a  scene  (see  Appendix  C).  Matches  and  a  camera  model 
make  it  possible  to  create  a  3-dimensional  world  model  (Baumgart,  unpublished 
research,  19731. 

For  most  applications,  there  is  no  need  to  match  particular  areas. 
What  is  needed  is  a  set  of  matches  that  are  well  distributed  in  both  images. 
Since  very  precise  matches  are  usually  needed  for  modelling  work,  It  will  be 
necessary  to  interpolate  discrete  matches  in  order  to  determine  the  exact 
translation.  (See  Appendix  B  for  a  discussion  of  the  need  for  and  techniques 
of  interpolation.)  Whenever  possible,  one  should  choose  the  target  areas  so 
that  matching  will  be  easy  and  interpolation  will  be  accurate. 


Choosing  a  Target  Area 

Interpolation  is  most  accurate  if  the  match  peak  is  well  behaved--not 
too  flat,  not  too  sharply  peaked,  and  definitely  not  multi-modal.  Since  the 
correlation  peak  should  closely  resemble  the  autocorrelation  peak,  target 
areas  should  be  limited  to  those  with  well  behaved  autocorrelation  peaks. 
The  target  areas  whose  autocorrelation  peaks  can  be  easily  fitted  by  a 
bivariate  normal  distribution  surface  are  most  likely  to  yield  accurate 
interpolated  match  displacements. 

Requiring  well  behaved  autocorrelation  peaks  will  also  exclude 
targets  which  will  be  hard  to  match.  Flat  autocorrelation  peaks  due  to  low 
information,  sharp  peaks  due  to  only  high  frequency  Information  being 
present,  and  multi-modal  peaks  due  to  ambiguities  will  all  be  avoided. 

To  make  matching  easy,  target  areas  should  first  of  all  contain 
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ra  jufficient  information.  Therefore,  only  area*  having  a  variance  above  :  a 
'■«  threshold  should  be  considered.  A  reasonable  strategy  Is  to  first  match 
$  those  target  areas  that  have  the  highest  variance.  Of  course,  high  variance 
can  indicate  the  presence  of  sharp  edges,  so  each  such  target  area  should  be 
checked  to  see  that  it  is  not  crossed  by  a  strong  linear  edge  betueen  tuo  lou 
variance  areas. 

If  similarity  is  to  be  employed  in  matching,  a  quick  perueal  of  the 
vectors  for  the  representative  areae  in  the  second  Image  can  be  informative. 
Por  instance,  if  the  second  image  contains  lots  of  green  areas,  but  only  a 
feu  red  ones,  then  one  car;  get  some  matches  cheaply  by  first  matching  target 
areas  uith  red  in  them. 


progra  m_0ut.  M  ne 

A  program  uhich  is  to  produce  a  set  of  uel  I  distributed  good  matches 
eight  proceed  ae  follous. 

INITIALIZATION.  First  of  alt,  reduce  both  Images  and  divide  them 
into  representative  areas  the  size  of  the  correlation  uindoue  to  be  ueed. 
(Unless  otheruise  stated,  all  of  the  steps  that  follou  are  to  be  carried  out 
in  the  reduced  pictures.)  The  areas  in  the  first  image  may  simply  cover  the 
picture;  those  in  the  second  image  should  be  on  a  finer  grid  eo  that  they 
overlap  significantly.  (See  Illustration  S-l)  Then  calculate  the  vectors  of 
statist  ice  for  these  representative  areas.  Hletogram  each  of  tha  components 
of  tha  vectors  for  each  picture. 

RANK  "TARGET  AREAS.  Nou,  do  any  of  the  component  histograms  shou  only 
a  feu  targets  areas  having  some  property  (like  being  red)?  Do  at  leaet  that 
number  of  candidates  shou  that  property  (if  not,  some  of  the  target  araae 
ui  I  I  be  out  of  t>j  field  of  vieu  in  the  second  image,  hence  unmatchable) . 
Put  any  areas  uhicn  seem  likely  to  be  easy  to  match  at  the  head  of  a  Met  of 
target  areas  to  be  tried. 

Next,  eort  the  remaining  target  area*  by  their  variance.  Place  those 
ulth  variances  above  the  lou  Inforsetion  threshold  on  the  list.  Also  eort 
the  candidate  areae  by  variance  and  remove  rny  uith  too  lou  variance.  Sinca 
ue  have  removed  the  lou  variance  target  arr<ae,  it  is  unlikely  that  any  of  the 
lou  variance  candidate  areas  ul 1 1  be  needed.  Start  matching  targets  off  the 
top  of  the  list. 

TARGET  MATCHING.  For  each  target  area,  check  to  sea  if  its 
autocorrelation  surface  ie  uel  I  behaved.  If  so,  establish  tha 
autocorrelation  threshold  and  grid  spacing  parameters  for  that  targat  area 
and  continue.  If  not,  discard  the  area  and  try  the  next  one. 

Cal  culate  the  similarity  measure  betueen  the  targat  area  and  each  of 
tha  ramainlng  candidate  areas  and  sort  them.  Start  on  the  most  likaly  araa. 
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Usinn  the  grid  spacing  tstablished  for  that  targe*,  grid  tie  candidate  areas 
and  ook  for  a  correlation  above  the  noise  threshold.  Then  search  the 
immediate  neighborhood  for  the  best  correlation  (or  simply  employ  MATCH, 
described  in  Appendix  B). 

if  ambiguous  matches  are  not  anticipated  to  be  a  problem,  stop 
examining  candidate  areas  as  soon  as  a  candidate  is  found  that  has  a 
correlation  above  the  target  area's  autocorrelation  threshold.  Otheruise, 
continue  examining  areas  until  the  measure  of  similarity  becomes  too 
dissimilar.  If  no  candidate  had  a  correlation  that  uas  high  enough,  forget 
that  target  area. 

Nou  go  back  into  the  original,  full  resolution  pictures. 
Re-determine  the  autocorrelation  threshold  for  the  full  resolution  target 
area.  Re-optimize  the  correlation  for  each  of  the  promising  candidate  areas. 
Test  these  correlations  for  bad  matches  and  ambiguity.  Discard  the  target 
area  if  it  fails  these  tests,  otheruise  interpolate  tne  match  in  the  full 
resolution  pictures  and  record  it.  Go  on  to  the  next  target  area. 

Continue  matching  target  areas  in  this  fashion  until  enough  matches 
uith  the  proper  spatial  distribution  are  obtained  or  the  list  of  matchable 
areas  is  exhausted.  Take  the  results  and  do  your  thing  uith  them. 

The  algorithm  described  here  has  not  been  implemented  in  totality, 
houever,  most  of  its  pieces  have  been  implemented.  Reduction  of  images  is 
accomplished  by  a  program  called  PICSEE  by  Lynn  Quam.  The  initialization  and 
sorting  of  target  areas  is  done  by  the  author’s  program  VECTDO  uhich 
calculates  the  color  vectors  described  ir  Chapter  3  or  VECTBO  uhich  does  the 
black-and-uhi te  vectors.  The  target  matching  Is  done  by  the  author's  program 
NEUPTS.  Final  decisions  in  the  full-scale  images  are  done  by  tnc  author's 
program  REFINE.  All  of  these  programs  are  uritten  in  the  SAIL  diblect  of 
ALGOL  IVanLehn,  1973]  at  the  Stanford  Artificial  Intelligence  Project. 
Critical  inner  loops  are  urittsn  in  START_CODE,  an  embedding  of  PDP-10 
assembly  language  into  SAIL. 

Illustration  6-2  shous  a  set  of  matches  produced  by  this  system  of 
programs  and  run  through  the  author’6  program  DEPTH  to  figure  the  depthe  at 
each  point  pair  in  meters. 


A  COMPLETE  MATCHING 


The  ultimate  combination  of  matching  techniques  occurs  in  an 
algorithm  for  creating  a  complete  matching.  Such  an  algorithm  puts  together 
all  of  the  techniques  ue  have  developed  and  shous  hou  they  interrelate. 

Us  begin  uith  the  algorithm  described  in  the  first  section  of  thie 
chapter.  This  gives  us  a  set  of  precise  Interpolated  matching  areas.  Ue 
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feed  the  point  pairs  to  a  camera  model  derivation  routine  uhich  returne  a 
camera  model. 

Next  we  seek  low  variance  regions  and  employ  one  of  the  uniform 
region  growers  described  in  Chapter  5  to  color  these  regions  unmatchable. 
All  region  growing  is  done  in  an  auxiliary  "picture"  which  we  will  uee  to 
keep  track  of  the  parte  of  the  first  image  that  we  have  proceeeed  and  to 
record  the  matches  which  have  been  made. 

The  matches  which  determined  the  camera  model  are  then 
un-interpolated--that  ie,  they  revert  to  the  discrete  form. they  had  before 
interpolation — and  put  onto  a  stack  of  regions  to  be  extended.  A  match  pair 
le  popped  off  of  thie  stack  and  passed  to  the  region  grower  for  extending 
matches.  Ae  the  region  grower  proceeds,  it  marks  in  the  areas  it  growe  In 
the  recording  picture  and  in  a  second  auxiliary  picture  which  keeps  track  of 
which  area  centers  in  the  second  Image  have  been  matched. 

Uhen  the  region  grower  finishes  each  sub-region  having  the  same 
displacement  (di.dJ),  a  cleanup  algorithm  goes  around  to  all  of  the  points 
marked  BAD  on  that  round.  So  that  future  growings  can  have  a  chance  to  work 
on  them,  they  are  re-marked  as  being  unmatched  and  placed  on  the  etack  of 
pairs  of  points  waiting  to  have  the  region  grower  applied. 

Each  pair  of  points  taken  off  of  this  stack  ie  re-MATCHed  (see 
Appendix  B)  to  find  the  correlation  peak,  which  is  compared  to  the 
autocorrelation  threshold.  Point  pairs  which  pass  this  criterion,  and 
haven't  been  overgrown  by  some  previous  extension,  are  paseed  to  the  region 
p>“ower,  until  the  stack  of  point  pairs  awaiting  the  region  grower  becomes 
empty. 


Uhen  the  original  set  of  matches  has  been  exhausted,  we  begin  looking 
In  the  recording  picture  for  areas  which  have  not  been  marked.  For  a 
repreeentative  point  in  the  midst,  of  such  a  region,  we  attempt  matching  ueing 
the  camera  model  ae  described  In  Chapter  3.  In  this  caee,  we  can  further 
limit  our  eearch  along  the  back-projected  line  in  the  second  image  with  the 
knowledge  that  eome  of  the  points  in  the  second  image  have  already  been 
matched,  hence  do  not  need  to  be  considered.  For  each  match  found  thie  way, 
the  region  grower  ie  started  up  again.  This  continues  until  all  of  the 
unmatched  areas  have  either  been  examined  or  are  smaller  than  some  critical 
size,  belou  which  we  do  not  bother  with  them. 

Ae  yet,  this  algorithm  has  not  been  Implemented  as  a  whole.  However, 
most  of  the  parte  do  exist  rs  separate  programs  which  communicate  with  each 
other  via  disk  fllee  of  data.  In  addition  to  the  programs  described  In  the 
last  section  for  finding  a  set  of  wel  l-di  str  Ibuted  matches,  we  uee  CAMERA, 
written  by  Donald  Gennery  at  Stanford  A.  I.  to  determine  the  camera  model 
corresponding  to  our  set  of  point  pairs.  The  finding  and  marking  of  low 
variance  areas  Is  done  by  the  author's  progrrm  LOUINF.  The  actual  extension 
of  regions  from  a  file  of  matching  area  centers  ie  dono  by  the  author's 


program  MGROW.  The  camera  model  search  for  matching  point  paire  is 
implemented  by  the  author's  program  CAMSCH.  As  with  the  programs  from  the 
last  section,  these  were  written  In  SAIL  on  the  POP-10  at  the  Stanford 
Artificial  Intelligence  Laboratory. 

Illustration  S-3  shows  the  results  of  the  author's  program  EMAKE  on  a 
complete  mapping  generated  by  this  system  of  programs. 
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Illustration  6-1.  Yard  pictures,  w '  th  overlaid  grids  for  target  and 
candidate  areas.  Notice  that  the  candidate  areas  are  on  a  much  finer  grid 
than  the  target  areas.  A  typical  target  and  the  representative  candidate 
which  most  closely  resembles  it  are  indicated  by  squares. 


43 


0 

IJ 

lj 

L 

[] 


u 


Li 


li 

D 

0 

0 

D 

li 

D 


Illustration  6-2.  Barn  pictures,  showing  a  set  of  matches  produced  by 
NEUPTS.  The  dots  indicate  the  center  points  of  the  matching  areas;  the 
numbers  by  the  dots  give  the  distances  in  meters  from  the  first  camera  to  the 
3-dimensional  points  which  correspond  to  the  point  pairs. 
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Illustration  B-3.  The  canyon  pictures,  showing  a  complete  matching.  The 
outlines  show  major  depth  discontinuities  and  delimit  areas  which  could  not 
be  matched. 
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Chapter  7 


CONCLUSION 


It  uas  the  purpose  of  this  thesis  to  investigate  techniques  by  uhich 
areas  of  one  picture  could  be  matched  with  the  corresponding  areas  from  the 
second  image  of  a  stereo  pair.  Ue  started  uith  the  assumption  that  ue  had 
two  images  of  the  same  scene  which  differed  someuhat,  but  the  majority  of 
which  could  be  matched  (as  opposed  to  mapped,  uhich  is  a  different  procese) . 
That  is,  ue  treated  those  parts  of  the  scene  for  which  no  gross  dletortione 
had  been  introduced  betueen  the  tuo  views.  Our  objective  of  making  matches 
efficiently  (ie.  uithout  calculating  the  correlation  betueen  the  target  area 
and  candidate  areas  centered  at  every  point  in  the  second  picture),  was  to  be 
reached  by  presenting  techniques  by  uhich  this  could  be  accomplished. 


ACHIEVEMENTS 


In  this  thesis,  ue  have  presented  tools  and  techniques  by  which  areae 
in  one  picture  can  be  efficiently  matched  with  the  corresponding  areae  in  the 
second  picture. 

Ue  have  discussed  three  measures  of  match  uhich  are  suitable  for  thie 
purpose,  normalized  cross-correlation,  rcot-mean-square  error  and  absolute 
difference.  In  addition  to  the  ordinary  one  dimensional  versions  of  these 
measures,  ue  have  documented  correlation  for  use  in  two  dimensions,  derived 
color  or  vector  correlation,  masked  correlation,  and  weighted  correlation, 
and  explained  function  correlation,  uhich  can  be  used  for  mapping.  Ue  have 
discussed  some  properties  and  relative  efficiencies  of  the  baeic  measures. 
Ue  have  mentioned  the  existing  techniques  of  fast  Fourier  convolution  and 
sampling  for  making  the  calculation  of  these  basic  measures  more  efficient, 
but  pointed  out  fheir  shortcomings.  It  is  our  position  that  our  techniquee 
have  none  of  these  shortcomings  and  are  more  efficient  that  theee  other 
methods. 


Ue  have  discussed  several  methods  for  pruning  the  search  for  a  match. 
Gridding  and  reduction  each  give  a  savings  factor  of  na,  where  n  depends  on 
the  data  in  the  images,  but  is  typically  3  (savings  factor  Is  9)  for  gridding 
and  5  to  10  (savings  factor  is  25  to  100)  tor  reduction.  Similarity  gives  a 
savings  factor  of  100  to  150  for  the  author's  data.  Camera  models  give  a 
savings  factor  of  N,  the  uidth  of  the  picture — typically  about  200.  Uorld 
model  assumptions  can  result  in  a  savings  factor  of  almost  Na,  the  area  of 
the  plcture--typical ly  10,000  to  40,000. 
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For  those  who  do  not  have  camera  models  given,  is  have  included  the 
mathematics  necessary  to  convert  a  set  of  matchings  into  a  workable  camera 
r'Ode  I  •  Ue  have  also  included  calculations  which  use  this  modei  to  find  the 
depth  of  the  3-dimensional  point  corresponding  to  a  given  pair  of  image 
points. 

Ue  have  discussed  the  fact  that,  with  real  data,  not  all  target  areas 
are  matchable.  Ue  have  given  methods  by  which  some  of  the  major  typee  of 
these  unmaichabi I i t ies  can  be  detected  in  the  original  data.  Since  some 
unmatchable  targets  cannot  be  detected  directly,  w*  have  developed  methods 
for  detecting  when  a  proposed  match  is  not  really  a  maicii. 

Ue  have  discussed  region  growing  techniques  which  can  be  used  to 
extend  match. ng  areas.  Because  these  are  based  on  the  continuity  assumption, 
a  sort  of  low  level  world  modei  assumption,  thsy  are  quite  efficient  methods 
of  finding  matches.  Ue  have  also  presented  region  growing  techniques  which 
can  be  employed  to  delimit  uniform  regions  in  one  image. 

Finally,  ue  have  presented  two  algorithms  demonstrating  how  the 
detract  techniques  ue  have  developed  and  documented  can  be  combined  to 
perform  useful  functions  in  the  processing  of  stereo  images. 


APPLICATIONS 


Some  of  the  techniques  of  this  thesis  have  already  been  adapted  for 
use  in  various  artificial  intelligence  and  robotics  tasks.  In  addition  to 
the  author's  programs  mentioned  in  Chapter  B,  the  reduction,  gridding,  and 
similarity  techniques  and  the  uniform  region  growing  have  been  incorporated 
mto  programs  for  servo-ing  a  computer  driven  cart  [Quam,  undocumented 
research,  19731.  Gridding  and  the  continuity  assumption  form  the  basis  for 
programs  in  a  feasibility  study  for  automating  photogrammetr Ic  studies  of  the 
planet  Mars  during  the  1975  Viking  mission  [Quam,  unpublished  research 
proposal,  19731.  The  complete  matching  techniques  described  in  Chapter  6 
will  undoubtedly  play  a  part  in  this  application,  also. 

Applications  of  multiple  image  processing  also  occur  in  meeficai 
research.  The  registration  of  time-lapse  x-rays  for  further  processing  is 
only  one  of  many  possibilities. 

Another  eventual  application  Is  pianetary  exploration.  For 
inhospitable  environments  and  extreme  distances,  on-board  computer  processing 
of  images  wili  be  vital  to  mission  success. 


AREAS  FOR  FURTHER  INVESTIGATION 


In  the  process  of  our  investigations,  we  have  discovered  a  number  of 
areas  which  need  more  work,  as  well  as  several  interesting  extensions  of  our 
work. 


The  field  of  area  mapping  is  for  the  most  part  untouched.  Ue  have 
scratched  the  surface  with  this  thesis  on  matching  and  our  brief  comments  in 
Appendix  0.  Much  more  can  and  should  be  done  In  this  field.  Complete, 
separate  investigations  of  techniques  for  motion  and  near-field  stereo  are 
needed. 


Ue  have  excluded  noise  from  our  data.  There  needs  to  be  extensive 
work  on  the  effects  of  noise  on  matching.  Also  In  neod  of  exploration  are 
♦he  techniques  for  alignment  of  regions  by  boundary  matching,  touched  upon  in 
Appendix  □. 

Ue  leave  these  as  challenges  to  future  investigators. 
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Appendix  A 


THE  I  MAGEE 


The  techniques  and  algorithms  described  in  this  thesis  have  been 
developed  and  tested  using  principally  four  pairs  of  pictures,  which  are 
described  and  presented  in  this  section.  Other  pairs  of  pictures  have  had 
isolated  techniques  used  on  them,  but  not  sufficiently  to  warrant  their  being 
presented  here. 

The  images  used  were  mainly  of  outdoor  scenes.  Some  contained 
man-made  objects  uhile  others  did  not.  The  main  criterion  for  selecting 
these  particular  pictures  to  work  with  was  that  they  were  available  and  that 
they  had  a  certain  esthetic  appeal  to  the  author. 

Due  to  the  limited  facilities  available  for  printing  this  thesie,  it 
is  not  feasible  to  reproduce  the  images  in  color.  Consequently,  the 
illustrations  presented  here  are  black  and  white  versions  of  the  images  used. 


THE  BARN  PICTURES 


The  first  and  most  used  pair  of  pictures  is  of  a  barn  and  field  near 
the  author's  home.  The  barn,  a  rather  rustic  building  of  unpainted  wood  with 
a  tin  roof,  appears  at  the  left  of  the  picture.  In  the  foreground  ie  a  etock 
fence  of  woven,  wire  topped  by  3  strands  of  barbed  wire  hung  on  hand-split 
fence  posts.  Due  to  the  relative  camera  motion  between  the  two  images  of  the 
etereo  pair,  ons  of  these  fenceposts  appears  at  ths  right  of  the  first  image 
rnd  at  the  left  of  the  second. 

The  area  in  front  of  the  barn  is  covered  with  green  grave,  on  which 
rest  several  abandoned  objects,  including  two  barrels,  a  lawn  chair,  and  a 
bench  lying  on  its  side.  The  shadow  of  a  tree  behind  the  camera  and  to  the 
right  fails  diagonally  across  this  grassy  area. 

The  grassy  area  extends  into  the  distance.  It  is  crossed  by  several 
fences,  one  of  boards  near  the  barn  and  the  rest  of  the  eame  materials  as  the 
foreground  fence.  The  land  rises  somewhat;  ths  skyline  ie  a  ridge  about  120 
meters  from  the  camera  positions.  Two  groves  of  oak  trees  cover  most  of  thie 
ridp^.  A  telephone  pole  stands  in  the  small  open  area  on  the  skyline  between 
the  two  grovss. 

The  original  photographs  were  35  mm  color  slides.  The  cameras  were 
hand  held  in  the  field;  the  distance  between  the  two  camera  positions  ie 
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slightly  over  one  Meter.  The  elides  were  photographed  under  standard  red, 
green,  and  blue  filters  to  produce  black,  and  white  negatives,  which  were  then 
digitized  commerci  al  ly.  The  resulting  800  by  1200  pixels  of  data  uere 
windowed  to  remove  a  light  leak  in  the  lower  portion  of  the  foreground  fence 
and  spatially  reduced  by  a  factor  of  five  to  produce  150  by  150  imagee. 
Illustration  A-l  shows  the  intensity  pictures,  Made  by  averaging  the  red, 
green,  and  blue  component  pictures. 

The  colors  in  the  picture  are  Mostly  blues,  greens,  and  browns.  The 
sky  ie  a  clear,  saturated  blue;  the  tin  on  the  roof  has  a  blue  tinge.  The 
tr«es  and  the  foreground  grass  are  green.  The  barn  and  fence  poets  are  a 
rusty  brown,  while  the  grass  in  the  distance  is  yellouish  brown. 

fh'i  barn  pictures  have  been  used  as  both  color  and  intensity  images. 
They  are  the  most  referred  to  images  in  this  thesis,  partly  because  they  were 
the  first  images  tried,  partly  because  they  present  so  many  different 
prob  ems  and  exercises  for  matching,  and  partly  because  they  are  the  author's 
favor  tes  among  the  images  used. 

Actually,  the  b»rr.  pictures  violate  tne  hypothesis  that  the  change  in 
point  ol  view  does  not  significantly  charge  the  perspective  of  the  scene. 
The  barn  door  is  half-again  as  wide  in  the  second  image  ae  it  it  in  the 
first,  a  significant  change.  These  changes,  along  uith  the  "moved" 
foreground  poet,  are  uhat  make  this  pair  of  picturse  difficult,  hence 
valuable. 

THE  LAB  PICTURES 

The  second  pair  of  pictures  is  of  Stanford  University's  D.  C.  PouerB 
Laboratory,  where  the  Artificial  Intelligence  Project  is  housed,  and  where 
the  author  works.  The  laboratory  building  crosses  the  picture  in  the  middle 
distance.  Behind  it  is  a  row  of  eucalyptus  trees,  through  which  the  skyline, 
a  ridge  about  five  miles  away,  can  be  seen. 

The  immediate  foreground  is  a  roadway.  Between  the  road  and  the  lab 
building  is  a  parking  lot  filled  with  a  variety  of  cars.  A  grassy  area  is 
immediately  in  front  of  the  building,  divided  by  a  concrete  walk  with  eteps. 
A  few  cars  are  parked  on  this  grassy  area  slightly  left  of  the  center  of  the 
images.  Due  to  the  slight  time  difference  between  the  actual  taking  of  the 
two  photographs,  there  is  a  man  walking  down  the  eteps  in  the  first  picture 
uho  does  not  appear  in  the  second  picture.  Also,  one  parking  space  has  been 
emptied  and  another  filled  if  that  time  interval. 

Lighting  is  from  overhead,  with  the  sun  slightly  in  front  of  the 
camera.  Thus  the  near  faces  of  the  building,  cars,  and  even  the  treee  are  in 
ehadow.  Some  reflection  occurs  from  automobile  uindshielde.  since  the  day 
uae  elightly  enoggy,  shadows  are  slightly  diffuse  and  the  dietant  hi  lie 
hardly  visible. 
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The  original  photographs  were  35  mm  color  slides.  The  camerae  were 
hand  held  in  the  fields  the  distance  between  the  two  camera  poeitions  ie 
approximately  ten  meters.  The  slides  were  photographed  under  standard  red, 
green,  and  blue  filters  to  produce  black  and  white  negatives,  which  were  then 
digitized  commercially.  The  resulting  1200  by  800  pixels  of  data  were 
windowed  to  remove  a  light  leak  at  the  right  end  of  the  building  and 
spatially  reduced  by  a  factor  of  five  to  produce  150  by  150  images. 
Illustration  A-2  shows  the  intensity  pictures,  made  by  averaging  the  red, 
gr e*n,  and  blue  component  pictures. 

Co.ors  are  predominantly  blues  and  yellows.  The  shadows  on  the  trees 
and  building  override  their  true  colors  with  a  blue  tinge.  Host  of  the  cars 
in  the  lot  are  blue,  grey,  or  whites  the  station  wagon  in  the  first  row  is 
red,  but  again  its  color  is  largely  masked  by  the  shadowed  near  side  and  the 
glare  off  of  the  hood.  The  grassy  areas  are  yellow,  with  some  green  alona 
the  walkway. 

The  lab  pictures  have  been  used  as  both  color  and  intensity  images. 
In  spite  of  the  wide  separation  between  the  cameras,  all  of  the  objects  are 
far  enough  away  to  avoid  problems  with  perspective  distortion.  However,  the 
presence  of  many  man-made  objects  of  uniform  color  and  having  linear  edges 
makes  this  pair  of  pictures  interesting. 


THE  CANYON  PICTURES 


The  third  pair  of  pictures  were  taken  from  the  rim  in  Bryce  Canyon 
National  Park  of  one  of  their  sandstone  formations.  In  the  middle  distance 
are  pinnacle?  and  a  narrow  spine  of  eroded  sandBtone  running  across  the 
picture.  In  the  far  distance  ie  the  other  side  of  the  canyon  with  sparse 
evergreen  trees  clinging  to  it.  Lighting  is  from  the  right,  casting  many  of 
the  faces  of  the  pinnacles  into  shadow. 


The  original  photographs  were  35  mm  color  slides.  The  cameras  were 
hand  held  in  the  fields  the  distance  between  the  two  camera  positions  is 
approximately  fifty  meters.  The  slides  were  digitized  by  use  of  a  special 
illuminating  attachment  to  one  of  the  A.  I.  Lab  Hand-Eye  television  camerae. 
The  pictures  in  Illustration  A-3  are  rul  I  scale  180  by  120  windows  out  of  the 
middle  of  the  originals,  to  pick  up  the  most  challenging  features. 


This  was  a  particularly  interesting  pair  of  pictures  from  an 
artificial  intelligence  point  of  view.  Using  only  the  intensity  information, 
humans  were,  for  the  most  part,  unable  to  pick  out  the  exact  location  of  the 
edges  of  the  mid-ground  formations.  Color  information  helped,  since  the 
background  formations  are  yellow-orange,  while  the  midground  ones  are  more 
red;  also  the  green  of  the  trees  helped  to  distinguish  them  from  dark  shadowe 
in  crevasses.  Still,  some  edges  required  looking  at  both  of  the  color 
picturee  before  people  could  locate  them  exactly.  The  challenge  wae  to  see 
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uhether  the  matching  and  depth  discontinuity  algorithms  could  do  well  with 
only  stereo  Intensity  information. 


THE  YAHU  PICTURES 


The  fourth  pair  of  pictures  Is  of  a  portion  of  the  area  around  the 
author's  home.  Part  of  the  cinder-block  garage  ual  I  is  visible  at  the  right 
side  of  the  picture,  with  ivy  growing  up  it.  A  board  fence  extends  from  the 
corner  of  the  building  across  the  picture.  Pyracantha  bushes  obscure  the 
fence  at  the  left  edge  of  the  picture.  The  fence  is  broken  in  the  middle  of 
the  picture  by  a  wooden  gate,  which  is  standing  open,  away  from  the  camera. 
There  is  a  small  rug  hanging  on  the  gate,  a  pair  of  gloves  on  the  fence,  and 
a  jar  sitting  on  the  gate  latch  post. 

Tuo  large  firewood  logs  aro  in  the  foreground  in  the  middle  of  the 
picture,  one  lying  on  its  side  and  one  standing  on  end.  The  one  on  end  has 
an  ax  handle  lying  across  its  the  ax  head  is  embedded  in  the  top  of  the  log. 
An  automobile  wheel  lies  between  the  upright  log  and  the  ivy.  There  ie  a 
plastic  dish-pan  upside  down  under  the  pyracantha  nearest  the  gate.  The  roof 
line  of  another  building  is  visible  just  over  the  fence.  Tree  tops  form  the 
background  of  tho  picture. 

The  original  photographs  uere  35  mm  black  and  uhlte  negativee.  The 
cameras  uere  hand  held  in  the  field;  the  distance  between  the  two  camera 
positions  ie  approximately  one  meter.  The  negatives  were  digitized 
commercially,  and  the  800  by  1200  pixels  of  data  were  windowed  eiightly  and 
spatially  reduced  by  a  factor  of  five  tg  produce  220  by  1G0  images. 
Illustration  A-4  shous  the  resulting  images. 

Again,  parts  of  this  pair  violate  the  hypothesis  of  no  perepective 
distortion.  Specifically,  the  foreground  logs  and  the  ax  handle  show 
significant  differences  in  orientation  in  the  two  images.  However,  the  other 
parts  of  the  images  make  excellent  material  for  matching.  Like  the  canyon 
pictures,  humans  have  some  difficulty  separating  the  background  from  the 
foreground  in  this  pair,  particularly  uhere  the  pyracantha  bushes  blend  into 
the  background  trees. 
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Appendix  0 


BASIC  CORRELATION  TOOLS 

For  the  purposes  of  this  thesis,  the  measure  of  match  between  two 
areas  will  be  normalized  cross-correlation.  It  will  ordinarily  be  calculated 
between  areas  that  are  rectangular  in  shape  and  have  odd  dimensions,  le. 

2m+l  x  2n+l  windows.  This  makes  it  easy  to  characterize  the  area  by  its 
center  point.  18 

In  this  and  the  f o I  I  wing  appendices,  the  following  mathematical 
conventions  are  used. 

Vectors  are  indicated  by  an  arrow  “  over  the  capital  hotter  which  names  the 
vector,  e.g.  A  is  the  vector  named  "A".  Unit  vectors  are  indicated 
by  a  hat  over  the  louer  case  letter  which  names  the  vector,  e.q  a 
.9  the  unit  vector  named  "a".  Specific  2-  and  3-dimensional  vectors 
may  be  written  out  (x,y)  or  <x,y,z),  respectively. 

Vector  dot  product  is  indicated  by  a  raised  dot  •. 

The  norm  or  length  of  a  vector  A  is  denoted  by  |  A  | . 

The  mean  of  a  vector  quantity  A  is  denoted  by  X. 

Exponent ia  of  the  quantitg  e  (the  ba.e  of  natural  logarithm,)  are 

represented  by  using  the  function  EXP. 


AREA  CORRELATION 


In  ....  Th,'„ba'.iC  °!  "atch  ,he  "c°rr'latl°"  coefficient"  dlecueeed 

*»•*  statistics  book..  (For  example,  see  Freund  (1382)1 

our  notation,  this  correlation  is 


I(Xj-X)*(Yj-Y) 


COR  - 


SQRT(  X(Xj-X)**J(Y|-Y)a) 
I  I 


(B-a) 


u?ih.0Ur^PUrP0!e8’  ,Xi  ^  Yi  £r*  lnt#n8,tU  values  at  corresponding  pixels 
within  the  rectangular  windows.  This  Is  Implemented  as 
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COR  -  C(  X.Ix.Jx;  Y.Iy.Jy  ) 


(B-b) 


Z  Z  (  Xtlx+i.Jx+j]  -  X  )*(  Ytly+i , Jy+j)  -  Y  ) 

-ms  ism  -nSjSn 

if 

SORT (  (  Z  Z  (  X tlx+l , Jx+j)  -  7  )a  ) 

-msism  -nSjSn 

*  (  Z  Z  (  Ylly+i , Jy+jl  -  Y  )a  )  ) 

-msism  -nSjSn 

where  (Ix.Jx)  and  (Iy,Jy)  are  the  centers  of  the  target  and  candidate  areae, 
respectively.  Since  this  is  rather  cumbersome  to  write,  we  will  abbreviate 
it  with  the  notation  of  Equation  B-a,  leaving  the  center  points  and  the  fact 
that  i  ranges  in  two  dimensions  over  the  (2n+l)*(2m+l)  pixels  in  the 
surrounding  windows  implicit.  The  means,  of  couree,  are  calculated  over  thie 
same  area. 

This  is  our  ordinary  form  of  correlation.  It  ie  primarily  ueeful  In 
an  application  where  each  image  consists  of  ne  (black-and-white)  picture. 


COLOR  CORRELATION 


In  the  case  of  color  images  there  are  three  pictures  involved.  Since 
the  color  images  we  currently  are  working  nith  were  obtained  by  digitizing 
three  black  and  white  pictures  which  resulted  from  photographing  an  ordinary 
color  elide  under  red,  green,  and  blue  filters,  respectively,  we  ehall 
consider  the  components  of  our  color  pictures  to  be  red,  green,  and  blue, 
which  we  will  symbolize  as  R,  G,  and  B. 

It  le  somewhat  more  convenient  to  think  of  a  color  picture  P  as  one 
array  of  vector-valued  points  (PR,PG,PB)  instead  of  three  separate  arrays  of 
ecalar-valued  pcinta  PR,  PG,  and  PB.  This  suggests  regarding  the  text-book 
version  of  nor.'.lized  cross-correlation.  Equation  (B-a),  ae  the 
one-dimensional  esse  of  a  vector  function 

Z  (  X,  -  X  )  •  (  7,  -  9  ) 

i 

VCOR  -  - 


SORT  (  Z  |  X,  -  9.  |*  *  Z  |  ?i  -  V  |a  ) 
i  I 


Considering  only  the  numerator  of  VCOR, 
and  be  (YR,YG,YB)j,  we  have 


and  letting  X-,  be  (XR,XG,XB) 
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I  (  Xi  -  X  )•(?,-  9) 


-  2  (  (XR,XG,XB)  •,  -  (XR.XG.XB)  )  •  (  (VR.VG,  VB)  j  -  (YR.YG.YB)  ) 


-  2  (  XR i -XR ,  XG i -XG,  XBj-XB  )  •  (  YR|-YR,  YG|-YG,  YBj-YB  ) 


-  2  (XRj-XR)*(YRi-YR)  +  (XG j -XG) * ( YG •, -YG)  +  ;XBj-XB)*(YB;-YB) 


If  we  notice  that  all  three  terms  within  this  sum  are  the  same  in 
form  and  change  the  definition  of  I  so  that  It  ranges  over  all  components  ae 
well  ae  all  elements  of  components,  we  get 

-  I  (  X-,  -  X  )  *  (  Yj  -  Y  ) 

I 

which  is  the  numerator  of  the  formula  for  ordinary  correlation  Equation 
(B-a).  By  similar  manipulations,  the  two  terms  in  the  denominator  of  VCOR 
become  the  same  as  the  two  terms  in  the  denominator  of  Equation  (B-a).  Thie 
means  that  color  correlation  is  really  a  dressed  up  form  of  ordinary 
correlation.  This  is  convenient,  for  it  i-eans  that  color  correlation  ui  I  I 
have  all  of  the  properties  that  ordinary  correlation  has  been  observed  to 
have. 


MASKED  CORRELATION 


Obviously  correlation  need  not  be  restricted  to  rectangular  uindows; 
the  correlation  coefficient  can  be  calculated  over  any  sample,  regardless  of 
shape.  The  only  reason  for  us  I  no  '.he  rectangular  wlndous  was  that  It  Is 
easier  to  set  up  indices  to  cover  a  rectangular  area  than  to  make  Indices 
trace  out  an  arbitrarily  shaped  area. 

To  do  correlation  over  oddly  shaped  areas,  It  is  first  necessary  to 
implement  a  uay  of  covering  arbitrarily  shaped  areas  easily.  Toward  this 
end,  the  idea  of  f,  correlation  mask  has  been  instituted.  The  mask  consists 
of  a  rectangular  nrray  M  uhleh  completely  covers  the  area  of  interest  and  is 
filled  with  ones  In  the  area  of  Interest,  and  zeros  elsewhere.  In  effect,  M 
Is  a  template  for  the  irregular  area. 

To  use  *he  mjsk,  one  sets  up  Indices  to  cover  the  rectangle,  as  In 
ordinary  correlation,  \'hen  uses  each  point  of  the  maek  ae  a  predicate  to  tell 
whether  or  not  to  Include  the  corresponding  pixels  In  the  sums  for  the 
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correlation  coefficient.  Mathematically,  this  is  equivalent  to  multiplying 
each  term  of  the  sums  by  the  corresponding  term  of  the  mask,  that  is 


2  (  Xj  -  X  )  *  (  Yj  -  Y  ) 

i|Mi-l 

MCOR  -  - 

SQRT  (  2  (  Xj  -  X  )*  *  I  (  Y-,  -  Y  )*  ) 

ilM-,-1  i|M)-l 

2  Mj  *  (  Xi  -  X  )  *  (  Yi  -  Y  ) 

m  ■  " 

SQRT (  2  M-,  *  (  X-,  -  X  )J  *  2  Mi  *  (  Y-,  -  Y  )a  ) 
i  i 

where  it  is  understood  that  the  summations  necessary  to  calcuiate  the  means 
are  done  only  over  the  valid  part  of  the  mask. 

Uhen  we  attempt  to  use  a  zero-one  correlation  mask  to  match  the  top 
of  the  foreground  fence  post  in  the  barn  pictures,  we  discover  that  the 
masked  post  correlates  best  with  a  piece  of  the  barn  wall  bsiow  and  tn  the 
right  of  the  intuitive  match.  Using  the  inverse  of  this  correlation 
mask — keeping  the  background  and  masking  out  the  post — works  finej  the  trees 
and  sky  match  up  as  one  would  intuitively  expect  them  to. 

Uhat  is  the  difference  between  these  two  cases?  In  ths  second  case, 
we  are  attempting  to  remove  an  intruding  object  and  match  around  it.  Ue 
don't  care  what  shape  the  object  is;  us  merely  want  to  get  rid  of  it. 

In  the  first  case,  we  are  attempting  to  match  a  specific  object  with 
definite  boundaries.  In  masking  out  the  background,  ue  have  aiso  masked  out 
the  fact  that  the  post  has  edges,  turning  the  post  into  a  piece  of  uood  which 
matches  the  wood  of  the  barn  ao  well  as  it  matches  its  true  counterpart  in 
the  second  image. 

In  order  to  match  specific  objects,  it  is  necessary  to  somehow  retain 
information  about  the  boundaries  of  the  objecte.  One  uay  to  do  thle  is, 
rather  than  maeking  out  everything  outside  the  areae  of  intereet  with  zeroee, 
tc  instead  weight  the  correlation  so  that  all  of  ths  window  is  considered, 
but  the  areae  of  interest  influence  the  correlation  mors  than  does  their 
background. 


UEIGHTED  CORRELATION 


This  suggests  replacing  ths  zsro-ons  correlation  mask  M  by  a  weight  maek  U, 
yielding, 
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UCOR  - 


I  U|  *  (  X,  -  7  )  *  (  Y,  -  7  ) 

i 


SQRT(  I  l’j  *  {  X(  -  X  )a  *  l  U|  *  (  Y|  -  7  )*  ) 

'  i 

This  necessitates  changing  the  nature  of  the  mean  used  from  the 
ordinary  averaging  mean  to  a  weighted  mean.  Thus,  instead  of  using 


X 


we  want  to  use 


I  X, 
i 


1  1 

i 


X 


1  U'j  *  Xj 
i 


I  U ; 


Indeed,  when  the  correlation  mask  for  the  foreground  post  is  fiiied  with  ones 
and  sevens,  instead  of  zeroes  and  ones,  the  algorithmic  match  is  the  dame  as 
the  intuitive  match:  post  matches  post. 


In  addition  to  being  used  in  moet  template  matching,  UCOR  can  aleo  be 
used  to  place  arbitrary  weights  in  a  wind:w,  as  shown  in  I i lustration  B-l. 


POINTER  CORRELATION 


Most  correlation  is  implemented  in  a  very  orderly  fashion.  A  pointer 
starts  at  the  upper  left-hand  corner  of  the  rectangle  to  be  covered  and  moves 
across  the  row  of  pixeis.  Uhen  it  gets  to  the  right  edge  of  the  rectangle 

it  returns  to  the  ieft  edge  in  the  next  row.  The  reason  for  this  is 
efficiency. 


No  matter  whether  the  pixels  are  placed  one  per  word  (or  fixed-length 
byte!  or  are  packed  and  unpacked  by  special  byte  handling  instructions,  the 
most  efficient  way  to  access  an  area  of  bytes  is  to  have  a  pointer  which  one 
increments.  The  efficiency  consideration  pretty  well  constrains  one  to 
scanning  iines  of  the  picture. 


Correlation  does  not  demand  thie.  Aii  that  correiation  requiree  is 
o  be  given  paire  of  points,  one  out  of  each  picture,  which  are  then 
incorporated  into  the  sums.  Another  way  to  implement  correiation  is  to  first 
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set  up  a  table  of  pointers,  then  simply  run  a  secondary  pointer  doun  the 
table  of  pointers.  Implemented  in  this  fashion,  correlat ion  becomes 


PCOR 


I  (  X  IP i  ]  -  X  )*(  YIP;]  -  Y  ) 
i 


SORT (  I  (  X  IP; ]  -  X  )a  *  I  {  Y  IP; ]  -  Y  )a  ) 
i  i 


where  i  now  ranges  over  the  table  of  pointers,  and  the  mear  s  are  calculated 
from  this  same  set  of  pointed  X  and  Y. 

Once  one  has  accepted  the  extra  cost  caused  by  looking  up  the  pointer 
before  one  can  use  it,  other  benefits  become  obvious.  For  instance,  ue  are 
no  longer  tied  to  rectangular  areas.  Once  the  pointers  are  set  up,  it  is 
immaterial  what  shape  they  cover--hexagons,  circles,  trapezoids,  and  even 
grossly  irregular  shapes  are  all  the  same  to  this  correlation.  This  does 
away  with  the  need  to  cover  a  rectangular  template  which  tells  whether  or  not 
to  include  a  given  point  in  the  correlation.  Since  as  much  as  half  of  a 
template  is  not  used  most  of  the  time,  not  having  to  consider  those  points  at 
all  could  result  in  a  vast  speedup  of  correlating  Irregular  areas. 

This  form  of  correlation  also  makes  it  possible  to  correlate  in 
pictures  with  known  distortions.  The  pointers  are  simply  set  up  to  take  the 
distortion  mapping  into  account.  For  Instance,  if  one  picture  is  known  to 
have  a  scale-factor  difference  from  the  other,  the  target  area  can  be  covered 
by  pointers  at  unit  spacing  while  the  candidate  area  Is  covered  by  pointers 
determined  by  the  scale  factor.  Any  other  knoun  distortion  can  be  handled 
similarly. 


One  can  even  access  the  pixels  In  en  area  randomly,  say  to  implement 
a  Barnea  and  Silverman  type  sampling  algorithm.  All  that  Is  needed  are  two 
parallel  tables  of  pointers  generated  in  some  pseudo-random  order. 


AUTOCORRELATION 


In  signal  processing,  ths  autocorrelation  function  is  an  Important 
tool  for  characterizing  the  frequency  content  of  a  signal.  The  fact  that, 
for  suitably  constrained  signals,  the  Fourier  transform  of  the 
autocorrelation  function  is  the  power-density  spectrum  of  the  signal  explains 
why  an  examination  of  the  autocorrelation  peak  can  give  such  a  good 
Indication  of  the  presence  of  extremely  high  or  low  frequency  components  in 
the  Image  [Lathi,  1968].  Our  main  Interest  In  autocorrelation,  however,  is 
not  as  a  tool  for  characterizing  the  image  data,  but  as  a  tool  for 
determining  what  correlation  values  might  be  expected  for  a  given  target 
area. 
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Let • A (I x, Jx; di , d j )  denote  the  correlation  betueen  an  area  of  picture 
X  centered  at  (lx.Jx)  and  an  area  of  picture  X  centered  at  (Ix+di,Jx+dJ) .  Ir. 
the  notation  of  Equation  B-b,  this  Is  expressed  as 

A(lx,  Jxjdi ,  d  j )  ■  C(  XJx.Jxj  X,  I  x+d  i ,  Jx+d  j  ) 

If  the  two  images  were  identical  except  for  a  constant  translation 
(Ti, T  j) ,  gain  A,  and  offset  B--ie.  Yti.J]  -  A  *  X ti+Ti , j+T j]+  B  for  all 
(i,J)  in  the  images--then  the  correlation  and  autocorrelation  surfaces  would 
be  exactly  Identical.  For  a  pair  of  areas  centered  at  (Ix,Jx)  and  (Iy,Jy) 
uhich  are  an  intuitive  match,  ue  would  have 

A ( I x , Jx ; d i , d  j )  -  C(  X, I x,  Jxj  Y,  ly+di, Jy+dj  )  (B-c) 

for  all  (di,dj)  within  the  two  images. 

This  is  rarely  the  case,  since  most  data  of  interest  will  have  more 
meaningful  changes  between  the  two  images  than  a  constant  translation,  gain, 
and  offset.  However,  when  we  assuned  that  there  is  little  or  no  distortion 
over  windows  of  the  size  being  correlated,  we  effectively  postulated  that  the 
changes  between  the  two  images  are  small  locally.  Consequently,  while 
Equation  B-c  usually  will  not  hold  for  all  (di,dj)  within  the  two  images,  it 
might  be  expected  to  hold  within  the  immediate  vicinity  of  the  matching  area 
centers. 

Now,  we  know  that  correlation  of  (Ix,Jx)  with  areas  centered  at 
points  around  (Iy.Jy)  yields  values  not  greater  than  the  correlation  with 
(Iy.Jy),  ie.,  for  0<|  (di,dj) |<2, 

C(  X,  I x, Jx;  Y, ly+dl, Jy+dj  )  $  C(  X,Ix,Jxj  Y, Iy.Jy  )  (B-d) 

for  It  was  the  fact  that  we  were  at  a  correlation  peak  which  helped  to 
determine  (Iy.Jy)  to  be  the  match.  Substituting  Equation  B-c  into  the  left 
side  of  Equation  B-d,  we  have  for  0<|  (di.dj)  |<2 

A(Ix, Jx ; d I , d j )  $  C(  X,Ix,Jx;  Y,Iy,Jy  ) 

ie.  that  the  match  correlation  is  not  less  than  any  of  the  immediate 
neighboring  Adx,  Jx;dl ,dj) .  Consequently,  we  would  expect  that  the 
correlation  ie  not  less  than  the  maximum  of  these  autocorrelations,  that  is, 

C(  X,  lx, Jxi  Y.Iy.Jy  )  *  MAX  A(Ix, Jxjdi , dj) 

0<| (dl,dj) |<2 

Experimentation  has  shown  that  the  match  ccrrelatlon  meets  thle 
criterion  for  some  90X  of  the  good  matches  found.  In  addition,  the 
correlation  at  false  matches  falls  to  meet  this  criterion  for  about  95%  of 
the  cases  examined. 


A  related  measure,  an  autocorrelation  calculated  between  the  target 
area  and  a  copy  of  itself  created  by  displacing  different  parte  of  the 
correlation  window  in  different  directions  as  shown  in  Illustration  B-2  also 
works  quite  well  as  a  floating  threshold.  This  measure  has  the  advantage 
that  it  can  be  calculated  in  one  pass  over  the  data,  rather  than  the  8  paseee 
required  to  calculate  the  8  neighboring  autocorrelations  *or  measures  based 
on  Adx.Jxjdi.dj)  for  0<|  (di.dj)  |<2.  Effectively,  this  threshold  measures 
how  uell  the  target  area  correlates  with  a  slightly  distorted  version  of 
itself.  A  large  number  of  other  distortion  patterne  can  also  be  ueed. 

This  autocorrelation  threshold  passes  about  98*  of  the  good  matches 
found,  and  rejects  approximately  99*  of  the  false  matches  encountered.  It  is 
this  threshold  which  ie  most  commonly  used  in  region  growing,  both  becauee  of 
its  ease  of  calculation  and  its  accuracy  of  prediction.  Unfortunately,  ue  do 
not  know  why  it  eeems  to  function  better. 

Ue  have  discussed  autocorrelation  in  terms  of  the  standard  area 
correlation.  Of  course,  if  another  form  of  correlation  Is  usod  to  determine 
the  match,  then  the  autocorrelatloi  must  use  that  same  type  of  correlation, 
be  it  maeking,  weighting,  or  pointer  correlation.  Similarly,  if  the  measure 
of  match  used  is  not  correlation  at  all,  but  one  of  the  difference  measures, 
then  the  "autocorrelation"  becomes  the  "autodifference".  Only  the  formula 
for  calculating  the  "automeasure"  changes;  the  mechanics  of  the  procees 
remain  the  same. 

Autocorrelation  has  a  number  of  uses.  As  we  mentioned  in  Chapter  3, 
the  autocorrelation  peak  can  be  used  to  determine  the  proper  width  of  the 
grid  for  the  search  reduction  technique  of  gridding.  The  value  of  the 
autocorrelation  threshold  can  also  be  included  in  the  vectors  used  in  the 
technique  of  similarity,  since  similar  areas  really  ought  to  have  similar 
autocorrelations.  Autocorrelation  surfaces  help  to  determine  whether  or  not 
a  g I vnn  target  area  is  suitable  for  matching.  Host  valuable,  perhaps,  ie 
deciding  whether  or  not  a  match  Is  good,  either  for  ieolated  matchee  or  for 
region  growing. 


THE  HATCH  SUBROUTINE 


Another  basic  part  of  correlation  usage  is  the  local  strategy  used  to 
search  for  a  matching  point  in  an  area  thought  to  be  promising,  host  of  our 
algorithms  for  determining  whether  or  not  an  area  is  promising  are  based  on 
whether  the  center  of  the  area  looks  promising.  Therefore,  it  makes  senee, 
when  considering  the  area  in  detail,  to  look  first  at  points  near  the  center 
and  gradually  work  out  toward  the  edges  of  the  area.  Ue  have  already 
observed  that  the  correlation  need  not  be  calculated  at  every  point  of  , an 
area--calculatlng  the  correlation  over  a  grid  is  adequate. 

Based  on  these  observations,  the  following  local  search  algorithm  uae 
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devised  [Quam,  1971}  to  seek  the  highest  correlation  within  a  square  area. 
The  algorithm  is  Implemented  as  a  subroutine  called  MATCH,  which  takes  four 
arguments.  The  first  two  are  the  coordinates  of  the  center  point  of  the  area 
to  be  searched*  when  the  routine  returns,  these  variables  contain  the 
coordinates  of  the  point  found  to  have  the  highest  correlation.  The  third 
argument  gives  the  radius  to  which  the  search  will  be  carried  out*  the  fourth 
tells  what  value  of  'correlation  is  to  be  the  threshold  for  search 
termination. 

As  shown  by  Illustration  B-3,  the  search  starts  at  the  center  point 
of  the  candidate  area,  then  sp  -als  outward  In  the  pattern  Indicated.  At 
each  point  marked  with  a  *,  the  correlation  is  calculated  with  the  target 
area.  The  point  having  the  highest  correlation  found  so  far  is  kept  track 
of.  Should  the  correlation  exceed  the  preset  threshold  or  the  search  radius 
be  reached,  the  search  stops  spiraling  and  goes  into  hill-climbing  mode  at 
the  point  which  had  the  highest  correlation. 

In  hi  1 1 -cl imbing  mode,  the  Plgorlthm  examinee  the  correlation  at  each 
of  the  eight  points  Immediately  surrounding  the  present  point,  and  moves  to 
the  point  which  has  the  highest  correlation.  This  loop  is  repeated  until 
there  Is  no  higher  point  to  move  to,  i.s.  the  summit  of  the  hill  has  been 
reached. 


The  grid  for  the  spiral  is  determined  by  a  table  within  the  routine. 
Originally,  Quam  set  the  table  so  that  the  algorithm  used  a  grid  spacing  of  2 
for  ths  first  loop,  3  for  the  next  3  loops,  4  for  2  loops,  then  5.  This 
author  has  Implemented  a  version  uhich  uses  a  constant  grid  spacing  for  all 
loops,  which  is  communicated  by  a  global  variable  MGRID.  This  parameter  is 
set  by  a  routine  which  examines  ths  autocorrelation  peak,  as  explained  in 
Chapter  3. 


THE  LHATCH  SUBROUTINE 

MATCH  ie  a  two-dimensional  search  strategy.  When  the  area  of 
interest  has  been  confined  to  a  linn,  however,  we  need  a  one-dimensional 
version,  IMATCH.  LMATCH  has  five  arguments.  The  first  four  are  the  same  as 
for  MATCH,  except  that  the  center  point  is  expressed  as  a  real  point  lying  on 
the  given  line.  The  fifth  argument  is  ths  slope  of  the  given  line. 

The  search  starts  by  calculating  the  correlation  at  the  picture  point 
closest  to  the  given  center  point.  It  then  moves  n  units  up  the  line  from 
the  given  starting  point  and  calculates  the  correlation  at  the  closest 
picture  point,  then  repeats  this  n  units  down  the  line  from  the  starting 
point,  then  2n  units  up  the  line  from  ths  starting  point,  then  2n  down,  then 
3n  up,  then  3n  down,  etc.  Again,  n  ie  determined  from  the  autocorre'at Ion 
and  communicated  by  MGRID. 
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Like  MATCH,  LMATCH  keeps  track  of  the  beet  correlation  found  so  far 
and  exits  from  this  "ping-pong"  spiral  uhen  it  reaches  the  radius  or  finds  a 
correlation  above  the  threshold.  From  the  point  having  the  best  correlation, 
it  goes  into  "inchuorm"  climbing  mods,  moving  along  the  line  in  the  uphill 
direction  until  it  can't  go  up  any  more.  Then  it  goes  into  the 
two-dimensionai  hiil  climb  of  MATCH,  just  in  case  the  line  uas  a  little  off 
and  the  matching  point  is  not  exactly  on  the  line. 


INTERPOLATION 


It  should  be  noted  that  all  of  ths  above  techniques  use  correlation 
over  areas  centered  on  inteyor  points  in  ths  picture.  In  practice,  however, 
the  proper  match  (in  the  sense  of  the  candidate  area  which  represents  the 
same  piece  of  the  scene  as  the  target  area)  for  a  given  target  will  be  an 
area  centered  on  a  point  in  Picture  Y  with  non-integer  co-ordinates.  Since 
the  only  correlation  values  which  are  available  are  those  at  integer  points, 
eome  form  of  interpolation  is  necersary  whenever  high  precision  it,  desired. 

Therefore,  ths  final  operation  on  a  match  destined  to  be  used  for 
depth,  camera  modei,  or  world  model  determination  is  an  interpolation.  Ue 
would  like  to  fit  a  function  of  the  form 

EXP(  -  (A*OIa  +  B*DJ  +  C*D I *0 J  +  D*DJ2  +  E*DJ  +  F)  )  (B-e) 

to  the  correlation  values  C(  X.Ix.Jx;  Y,  Iy+DI ,  Jy+DJ  )  for  (DI.DJ)  within  some 
radius  of  the  matching  center  points.  To  do  this,  we  fit  the  polynomial 
A*0I’  ♦  B*0I  +  C*DI*0J  ♦  D*DJ2  ♦  E*DJ  *  F  to  the  logarithm  of  the  correlation 
values.  Soi\ing  this  function  for  a  maximum  gives  the  interpolated 
non-integer  center  point  for  the  matching  area  in  Picture  Y. 

Uhen  a  model  of  the  autocorrelation  surface  is  desired,  this  same 
exponential  fitting  process  is  applied.  Rather  than  being  used  to 
interpolate  the  autocorrelation,  this  exponential  surface  is  used  as  an 
approximation  to  the  autocorrelation  peak.  Examination  of  the  coefficients 
of  Equation  B-e  provide  an  sasy  way  to  determine  the  width  of  the  peak, 
whether  for  calculation  of  the  grid  spacing  or  determination  of  the 
euitabiiity  of  the  area  for  matching. 
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Exponential ly  weighted  window. 


Illustration  B-l.  Uindous  of  weights,  such  as  eight  be  used  when  minor 
distortions  are  present. 
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Illustration  3-2.  A  sketch  showing  the  manner  in  which  a  window  could  be 
distorted  to  determine  an  autocorrelation  threshold  over  it.  Pixels  within 
the  four  areas  spaced  about  the  center  point  C  as  shown  in  the  ieft  drawing 
are  correlated  with  pixels  in  the  areas  spaced  about  C  as  shown  in  the  right 
drawing. 
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Iiiustration  B-3.  A  representation  of  the  search  pattern  for  the  subroutine 
MATCH.  The  algorithm  begins  at  the  center  point  and  spirais  outward 
foi  lowing  the  arrows  and  calculating  correlations  at  the  point*;  marked  *.  it 
etope  epiral  I  ing  when  it  finds  a  sufficiently  high  correlation  or  reaches  the 
radiue  of  the  spiral. 


Appendix  C 


CAMERA  MOOEL  CALCULATIONS 


For  our  purposss,  a  camera  model  con6iets  of  eeven  numbers  which 
specify  the  principal  distances  of  the  two  cameras  and  the  position  and 
orientation  of  the  second  camera  with  respect  to  the  first.  (The  principal 
distance  of  a  camera  ie  the  dietance  between  its  image  plane  and  i  te 
principal  point  along  its  principal  axis  as  shown  in  Illustration  C-l).  Thie 
appendix  contains  the  mathematics  used  in  deriving  and  utilizing  camera 
models. 


DERIVATION  OF  CAMERA  MODEL  EQUATIONS 


Ue  begin  by  arbitrarily  placing  a  left-handed  3-dimensionai 
co-ordinate  system  on  the  world  in  the  following  manner.  The  origin  of  this 
co-ordinate  system  is  the  principal  point  of  the  first  camera.  The  principal 
axis  of  the  camera  becomes  the  z-axis  of  the  world.  The  scale  of  the 
co-ordinate  system  is  such  that  one  unit  equals  the  width  of  one  pixol  on  the 
image  plane.  (See  Illustration  C-l) 

Mathematical ly,  the  principal  point  has  position  (0,0,0) •  a  point  on 
the  principal  axis  is  represented  by  d*f0,0,l),  and  the  image  plane  hae  the 
equation  z-Fl.  The  I-  and  J-axes  of  the  first  camera  plane  are  parallel  to 
the  X-  and  Y-axes  of  the  reference  co-ordinate  system,  respectively,  and  in 
the  plane  z  -  FI,  that  ie, 

T  -  (0,0, FI)  +  lx*(l,0,0)  and  3  -  (0,0, FI)  +  Jx*(0,l,0) 

The  principal  point  of  the  second  camera  is  the  point  in  3-space 
described  by  the  baseline  distance  D,  which  is  the  distance  between  the 
principal  points  of  the  two  cameras,  and  by  two  angles,  ol  and  a2.  When  the 
firet  camera  has  been  panned  by  crl  radians,  then  tilted  by  a2  radians,  ite 
principal  axie  wiii  point  down  the  baseline  toward  the  principal  point  of  the 
second  camera.  (Sec  iiiustration  C-2) 

Mathematically,  panning  ie  equivalent  to  a  rotation  about  the  Y-axis? 
tiiting  ie  equivalent  to  a  rotation  about  the  X-axis.  The  vector  U  is 
obtained  by  taking  the  vector  (0,0,1),  pre-mul t ipiying  it  by  the  matrix 
Rx(o2),  representing  a  rotation  of  a2  degrees  about  the  X-axie, 
pre-mdi t ipiying  this  result  by  the  matrix  Ry(ol),  representing  a  rotation  of 
al  about  the  Y-axie,  and  final iy  multiplying  this  quantity  by  the  scalar  D, 
i .  e . 
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0  -  D*(  Ry(al)*(  Rx(a2)*(0,0, 1)  )  ) 
-  Ry(al)  *  Rx(a2)  *  (0,0,0) 


—  __ 

—  — 

COS(al)  0  SIN(al) 

1  0  0 

0 

0  1  0 

* 

0  COS (a2)  SIN (o2) 

* 

0 

-SIN(al)  0  COS(al) 

0  -SIN(a2)  C0S(a2) 

1 

i 

““  _ 

_  J 

where  matrix  multiplication  is  denoted  by  *  and  done  in  the  usual  faeh- 
Performing  these  multiplications,  we  have  fashion. 


0  -  D*  (  SIN(al)#C0S(a2) ,  SIN(a2>,  C0S(al)*C0S(a2)  )  (c  a) 

rs  =:« 

Mathematically,  fi  is  expressed  by  pre-mul tipiying  the  vector  (001) 
by  the  appropriate  rotation  matrices  Ry(01)  and  Rx(02),  |%.  '  ’ 

fi  -  Ry(01)*(  Rx(02)*(0,0,  i )  ) 


-  Ry(01)  *  Rx  (02)  *  (0,0,1) 


r—  _ 

1 

1 

—  _ 

-  — 

COS (01)  0  SIN(01) 

1  0  0 

0 

0  1  0 

* 

0  COS (02)  SIN (02) 

* 

0 

-SIN (01)  0  COS (01 ) 

0  -SIN (02)  COS (02) 

1 

- 

_ 

Performing  these  multiplications,  we  have 

fi  -  (  SIN(01)*COS(02),  SIN(02),  COS(01)*COS(02)  ) 

The  image  plan*  of  the  second  camera  is  the  plena  perpendicular  to 
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the  principal  axis  at  distance  F2  from  the  principal  point.  (See 
Illustration  C-3)  According  to  a  standard  analytic  geometry  textbook 
[Schuartz,  I860],  the  plane  perpendicular  to  the  vector  B  and  passing  through 
the  point  P  has  the  equatior 


£j  •  (  (x,y,z)  -  P  )  ■  0  . 

Our  image  plane  is  defined  to  be  the  plane  perpendicular  to  the 
principal  axis  fi  and  passing  through  the  point  0  +  F2*fi.  Substituting  these 
for  §  and  P,  respectively,  yields 

fi  •  (  (x,y,z)  -  0  -  F2*fi  )  -  0 

The  actual  orientation  of  the  second  image  plane  is  described  by  the 
angle  03  through  which  the  first  image  plane  must  roll  (after  having  been 
panned  and  tilted  to  make  the  principal  axes  parallel)  in  order  to  make  the 
internal  co-ordinate  axes  of  the  *irst  ctnora  agree  with  those  of  the  second 
camera.  (See  Illustration  C-4)  Let  the  I-  and  J-axes  of  the  second  camera 
plane  be  represented  by  the  unit  voctors  ?  and  g,  respectively. 

The  orientations  of  ?  and  g  depend  on  the  pan  and  tilt  angles  01  and 
02,  as  well  as  the  roll  angle  03.  Mathematically,  a  roll  is  equivalent  to  a 
rotation  about  the  Z-axis.  Let  Ry(01)  be  the  rotation  matrix  corresponding 
to  panning  by  01,  Rx (02)  be  the  rotation  matrix  corresponding  to  tilting  by 
02,  and  Rz(03)  be  the  rotation  matrix  corresponding  to  rolling  by  03,  i.e. 


Ry (01) 


COS (01 )  0  SIN (01) 


0  1  0 


-SIN (01)  0  COS (01) 


Rx (02)  -  0  COS (02)  SIN (02)  ,  and 

0  -SIN (02)  COS (02) 
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COS (03)  SIN (03)  0 


Rz (03) 


-SIN (03)  COS (03)  0 


0 


0 


then  we  can  express  ?  and  g  as 


?  -  Ry(01)  *  Rx (02)  *  Rz (03)  * 


) 

0 

0 


and 


g  -  Ry (0.1)  *  Rx (02)  *  Rz (03)  *  1 


L 


e 


Multiplying  out  these  matrices  in  the  usual  fashion  gives 

?  -  (  COS(01)*COS(03)+SJN(01)*SIN(02)*S1N(03) , 
-COS(02)*SIN(03), 

COS  (01 )  *S1  N  (02 )  *S  I N  (03 )  -S I N  (01 )  COS  (03)  ) 
g  -  (  COS(01)*S1N(03)-SIN(01)*SIN(02)COS(03) , 
COS  (02)  COS  (03), 

-S1N(01)*SIN<03)-COS(01)*S1N(02)COS(03)  ) 


The  I-  and  J-axes  for  the  second  camera  radiate  from  the  point 
0  +  F2*fi,  so  we  havo 

T  -  0  +  F2#fi  +  Iy#?  and  3  ■  0  +  F2*fi  +  Jy*g  , 

To  derive  a  camera  model,  one  takes  a  set  of  pairs  of  points  found  to 
be  matches  and  searches  in  the  space  of  FI,  T-Z,  ol,  cc2,  01,  02,  and  03  for 
the  values  of  these  parameters  which  bes  accounts  for  these  point-pairs. 
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Actual  determination  of  the  model  is  done  by  least-equares  minimization  of 
the  measure  of  camera  model  error  presented  below.  Ae  in  most  I east-squaree 
techniques,  the  number  of  point-pairs  must  be  greater  than  N/2,  where  N  is 
the  number  of  parameters  being  sought,  and  should  be  independent  points,  i.e. 
no  three  co- 1  inear  in  the  image  planes  and  no  four  representing  co-planar 
points  in  3-space.  In  practice,  the  number  of  reliable  pairs  available,  p, 
should  satisfy  p  2  2N,  or  in  our  case  of  N-7,  p  i  14.  The  program  which 
derives  the  camera  model  sets  an  upper  limit  of  108  on  the  number  of  paire 
uhich  can  be  ueed. 


CAMERA  MODEL  ERROR  MEASURE 


There  are  many  error  measures  possible.  The  one  presented  here  Is 
the  average  of  the  error  in  match  for  each  of  the  point-pairs,  calculated  in 
the  image  plane.  To  calculate  the  error  in  match  for  each  point-pair,  we 
first  use  the  first  camera  principal  point  to  project  point  x  of  picture  X 
Into  space  as  a  ray,  then  U6e  tie  second  camera  princioal  point  for  the 
hypothesized  camera  model  to  project  this  ray  Into  the  second  image  plane  as 
a  2-dimensional  line  segment,  and  finally  evaluate  the  distance  in  the  second 
image  plane  between  this  line  segment  and  the  matching  point  y  of  picture  Y. 

Point  x  of  Picture  X  is  the  point  Ux,Jx)  in  the  plane  of  the  first 
camera,  which  is  the  point  S  ■  (Ix,Jx,Fl)  in  3-space.  The  projection  of  this 
point  into  space  is  the  ray  from  the  principal  point  of  the  first  camera, 
(0,0,0),  through  §.  In  paramete~izod  vector  form,  this  ray  is  r#§,  r>Fl. 

Ueing  the  principal  point  of  the  second  camera,  this  ray  is  projected 
into  the  imago  plane  of  the  second  camera.  Perhaps  the  simplest  way  to 
derive  this  is  to  pick  two  arbitrary  points  on  r*E>  and  project  them  into  the 
.second  camera  image  plane,  then  calculate  the  2-dimensional  line  between 
them. 

To  facilitate  this,  first  consider  projecting  an  arbitrary  point  Q  in 
3-space  into  the  plane  P  (in  our  case,  the  image  plane  of  the  second  camera) 
perpendicular  to  the  vector  fi  (Direction  of  the  principal  axis  of  the  second 
camera)  at  the  point  C  -  0  +  F2*fi  (intersection  of  principal  axis  and  second 
image  plane*  using  the  point  0  (principal  point  of  the  second  camera)  as  the 
principal  point  of  the  projection.  Clearly,  the  projected  point  lies  at  some 
d  stance  t  along  the  line  from  S  to  0,  so  can  be  described  by  the  position 
vector  S'  -  0  +  t*(S-U). 

Ue  would  like  to  express  S'  in  terms  of  the  vectors  ?  and  3  which  are 
ortho-normal  and  lie  In  the  image  plane  P.  That  Is,  we  would  lika  to  know  I 
and  J  such  that 

0  +  F2*fi  +  I*?  +  J»g  .  0  ♦  t*(  j  -  0  )  or 
I*?  ♦  J*g  •  t*(  fl  -  0  )  -  F2*fi  . 


Dotting  both  sides  of  this  vector  equation  by  ?  gives 

(  I*?  +  J*g  )  •  ?  -  (  t*(  S  -  0  )  -  F2*fi  )  •  ?  . 

Expanding  this,  and  using  the  fact  that  ?  is  a  unit  vector  and  e 
perpendicular  to  both  fi  and  g,  ue  have 

I  «  t*(  3  -  1)  )  •  ?  . 

Had  ue  dotted  both  sides  of  the  equation  by  g,  ue  uould  have 
J  -  t*(  3  -  0  )  •  g  . 

Dotting  both  sides  by  h  uould  give 

t*(  0  -  0  )  •  fi  -  F2  -  0  or 

t*(  3  -  0  )  •  fi  -  F2  or 

t  -  F2/(  3  -  0  )  •  fi  . 

Substituting  thie  expression  for  t  into  the  expressions  for  I  and  J,  ue  have 

( a  -  a )  •  ?  (d-0)*g 

I  -  F2  *  -  and  J  ■  F2  *  - 

(  3  -  0  )  •  fi  (  fl  -  0  :  •  fi 

Nou  ue  are  ready  to  project  tuo  arbitrary  points  on  the  ray  r*5  into 
the  plane  P  using  the  above  equations.  In  the  co-ordinates  cf  the  second 
image  plane,  the  points  c0*S  and  cl*§  become  the  points  (x0,y0)  and  (xl,yl), 
respectively,  uhere 

(  c0*3  -  0  )  •  ?  (  c0*§  -  0  )  •  g 

(x0,y0)  -  (  F2  *  -  ,  F2  *  -  )  end 

(  c0*3  -  0  )  •  fi  (  c0*S  -  0  )  •  fi 

(  cl*3  -  0  )  •  ?  (  cl*S  -  0  )  ♦  g 

(xl.yl)  -  (  F2  *  -  ,  F2  *  -  ) 

(  cl*3  -  0  )  •  fi  (  cl*S  -  0  )  •  fi 

According  to  our  analytic  geometry  text  (Schuartz,  1960),  the 

equation  for  the  2-dimensional  line  through  these  tuo  pointe  is 

(  yl  -  y6  ) 

y  -  yl  ■  - *  (  x  -  xl  )  or 

(  xl  -  x0  ) 

(  yl  -  y0  )  *  x  +  (  x0  -  xl  )  *  y  +  (y0*xl  -  yl*x0)  »  0  , 

Evaluating  (  yl  -  y0  ),  ue  have 
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(  cl*S  -  0  )  •  g  {  C0*S  -  0  )  •  g 

(  yl  -  g0  )  -  F2  *  -  -  F2  *  - 

(  cl*S  -  0  )  •  fi  (  C0*S  -  0  )  •  fi 

(cl*S-0)'g  *  (c0*S-O)'fi  -  (c0*S-O)'g  *  (cl*S-0)*fi 


-  F2  * 


-  F2  * 


(cl*S-U)'fi  *  (c0*S-O)‘fi 
(cl  -  c0)  *  (  S'fi  *  O'g  -  S'g  *  O'fi  ) 


(cl*S-0)'fi  *  (c0*S-O)*fi 
Similar  manipulations  givrj 

(cl  -  c0)  *  (  S'?  *  O'fi  -  S'fi  *  0*?  ) 


(  xB  -  Ml  )  -  F2  * 


(ci*S-0)'fi  *  (C0*S-O)'fi 

Substituting  into  (  xl*y0  -  yl*x0  ),  us  have 

(cl*S  -  0)*?  (c0*S  -  0) *g 

(  xl*y0  -  yl-*x0  )  -  F2* -  *  F2* - 


(cl*S  -  0)*fi 

(cl*S  -  0) • g 
-  F2* -  *  F2* 


(c0*S  -  0) *fi 
(c0*S  -  0)*? 


-  F2a»- 


(cl*S  -  0)'fi  (c0*S  -  0)'fi 

(cl*S-0)‘?  *  (c0*S-O)'g  -  (cl*S-0)'g  *  (c0*S-O)*? 


F2a  * 


(cl*S  -  0)'fi  *  (c0*S  -  0)'fi 
(cl  -  c0)  *  (  S'g  *  O'?  -  S'?  *  O'g  ) 


(cl*S  -  0)'fi  *  (c0*S  -  0)'fi 

Now,  substituting  these  terms  into  the  equation  for  the  line  gives 

(cl  -  c0)  *  (  S'fi  *  O'g  -  S'g  *  O'fi  ) 

F2  * - - *  x  + 

(cl*S-0)'fi  *  (c0*S-O)*r 

(cl  -  c0)  *  (  S'?  *  O'fi  -  S'fi  *  O'?  ) 

F2  *  - — — — -  *  y  + 

(cl*S-0)'fi  *  (c0*S-O)»fi 

(cl  -  C0)  *  (  S'g  *  O'?  -  S'?  *  O'g  ) 

F2a  * - - -  .  0  . 

(ci*S-0)'fi  *  (c0*S-O)'fi 
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Factoring  out  common  terms  and  dividing  by  them  gives 


(  S*fi  *  0*g  -  S*g  *  0‘h  )  *  x  + 

(  §♦?  *  0*fi  -  S*fi  *()•?)*  y  + 

(  §*g  *  0*?  -  §•?  *  0*g  )  *  F2  -  0  .  (C-b) 

the  desired  line  segment  in  the  second  image. 

The  error  for  that  point-pair  is  the  square  of  the  minimum  distance 
between  this  line  segment  which  corresponds  to  the  point  x  and  the  point  y 
which  matches  the  point  x.  (See  Illustration  C-5) 


DEPTH  RANGING 


Once  one  has  a  camera  model,  it  Is  relatively  trivial  to  find  the 
distance  from  either  of  the  cameras  to  an  object  in  3-space  represented  by  a 
point-pair.  One  has  the  points  (Ix.Jx)  and  (ly,Jy).  The  ray  from  the 
principal  point  of  the  first  camera  through  (Ix,Jx)  given  by  the  vector 
r*  (Ix.Jx,  FI) ,  The  ray  from  the  principal  point  of  the  eecond  camera  through 
(Iy.Jy)  ie  given  by 

0  +  s  *(  F2*fi  +  ly*?  +  Jy*g  )  (C-c) 

Due  to  minor  errors  in  measurements  of  camera  model  parameters  or  in 
interpolation  of  the  matching  center  point,  these  two  ray.  may  not  intersect. 
Using  the  camera  model,  we  can  correct  for  this.  He  first  back-project  the 
point  x  into  the  second  image  plane,  giving  us  the  line  of  Equation  C-b. 
Now,  instead  of  the  point  (Iy,Jy),  we  decree  the  point  Uy'.Jy')  which  is  on 
this  line  and  which  is  the  shortest  distance  away  from  (Iy,Jy)  to  be  the  true 
matching  point.  This  gives  us  the  ray 

0  +  s  *(  F2*fi  +  ly'*?  +  Jy'*g  )  (C-d) 

in  lieu  of  Equation  C-c. 

To  simplify  the  notation  in  the  following  derivation,  let 

P  -  (Ix.Jx, FI) 

3  -  F2*fi  +  ly'*?  +  Jy'*g  . 

Ue  know  that  the  intersection  of  r  *  P  and  0  +  s  *  3  exists;  that  is  the 
definition  of  Uy',Jy').  Therefore,  we  need  only  solve  for  the  r  and  s 
such  that  r*P«0  +  s*3.  The  two  necessary  constraints  are  obtained  by 
dotting  both  eides  of  this  equation  by  P  or  by  3,  le. 
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(  r  *  P  )  •  P  -  (  0  +  s  *  9  )  •  P 


and 


(  r  *  P  )  •  9  -  (  0  +  s  *  9  )  ‘9 

These  are  equivalent  to 

r  *  P‘P'  -  0*P  +  s  *  9*P  and 

r  *  P-9  “  0*9  +  s  *  9*9 

Solving  this  equation  for  s  gives  the  distance  of  the  3-dimensionai  point 

from  the  second  camera 

0*9  *  P*P  -  0»P  *  P*9 

g  „ - , 

d*P  *  P*9  -  P*P  *  9*9 

while  solving  the  above  system  for  r  gives  the  distance  from  the  first  camera 

0*9  *  9‘P  -  0»P  *  L  J 

r  ■  -  •  . 

9‘P  *  P*9  -  P»P  *  9*9 


DERIVATION  OF  TUO  MATCHING  LINES 


Uith  a  camera  model,  it  is  possible  to  place  two  lines,  one  in  each 
picture,  into  correspondence.  To  see  this,  consider  the  two  principal  points 
of  the  cameras,  (0,0,0)  and  0. 

These  two  points,  plus  any  third  point,  determine  a  plane  in  3-space. 
If  we  call  the  third  point  9,  then  this  plane  has  as  its  normal  the  vector  § 
x  0  and  goes  through  the  point  (0,0  0).  Our  analytic  geometry  text  teils  us 
that  the  equation  of  a  plane  with  normal  through  the  point  P  its 

R  ♦  (  (x,y,z)  -  P  )  -  0 

Therefore  the  equation  of  the  plane  determined  by  (0,0,0),  0,  and  9  hae  the 
equat ion 


(9x0)  ♦  (x.y.z)  -  0  (C-e) 

Except  in  a  few  degenerate  cases,  this  plane  intersects  both  of  the 
camera  image  planes.  The  intersection  of  this  plane  with  the  second  image 
plane  in  terms  of  that  plane's  coordinate  system  is  given  in  Equation  C-b; 
the  Intersection  with  the  first  image  plane  z  -  FI  is 

(9x0)  •  (x.y.Fl)  -  0  (C-f) 
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Consider  also  the  intersection  of  the  plane  of  Equation  C-e  with  the 
scene.  All  of  the  points  of  this  curve  lie  on  the  plane  of  Equation  C-e, 
obviouslyj  therefore  all  of  the  projections  of  thess  points  onto  the  second 
image  plane  lie  on  the  line  of  Equation  C-b  and  all  of  the  projections  onto 
the  first  image  plane  lie  on  the  line  of  Equation  C-f.  Thus  all  of  the 
points  on  one  line  map  to  points  on  the  other  line. 

Clearly,  §  can  be  almost  any  point.  The  most  convenient  such  point 
is  usually  (Ix.Jx.Fl),  the  center  point  of  the  target  srra. 


II  lustration  C-2.  The  firet  camera,  panned  and  tilted  to  point  to  the 
principal  point  of  the  second  camera. 


1 1  I uetrat i cn  C-5.  The  error  for  a  point  pair  (Ix.Jxi  *♦  (Iy,Jy)  is  the 
distance  from  (Iy.Jy)  to  the  line  in  the  second  image  which  corresponds  to 
(lx, Jx) . 


Appendix  D 


DISTORTION 

Intuitively,  if  the  parts  of  the  two  pictures  which  represent  a  given 
object  differ  in  anything  but  position,  then  the  object  has  been  distorted 
from  one  view  to  the  other.  For  our  purpose*,  if,  for  displacements  (di,dj) 
within  some  window  and  corresponding  points  (lx,Jx)  and  (Iy,Jy)  in  the  two 
images,  the  point  (Ix+di ,  Jx+dj)  does  not  correspond  to  the  point 
( I y+ d i , Jy+d j ) ,  there  is  distortion  over  that  window. 

MATHEMATICAL  DESCRIPTION 

To  express  this  mathematically,  we  start  with  two  points  in  3-space, 
R  and  §.  According  to  the  calculations  in  Appendix  C,  these  points  project 
to 

fK  &•] 

Rl  -  (  Rlx,  Rly  )  -  (  — *F1  ,  — *F1  )  and 

S-T  §♦] 

§1  ■  i  Six,  Sly  )  -  (  — *F1  #  — *F1  ) 

in  the  first  image  plane  and 

(  B-U  )•?  (  B-0  )‘g 

S2  -  (  R2x,  R2y  )  ■  (  - *F2  ,  - *F2  )  and 

(  S-0  )*fi  (  FI-0  )*fi 

(5-0)*?  (  §-0  )*g 

§2  -  f  S2x,  S2y  )  -  (  - *F2  ,  - — *F2  ) 

(  S-0  )»fi  (  S-0  )'fi 

in  the  second  image  plane. 

Suppose  we  iet  §  be  the  reference  point,  that  is,  we  set 
(Six, Sly)  -  (Ix.Jx)  and  (S2x,S2y)  -  (Iy,Jy).  Also,  let  (Rlx-Slx,Rly-Sly)  be 
the  (di,dj)  of  our  intuitive  definition.  There  is  distortion  if  the  point 
which  corresponds  to  (Ix,jx)  +  (di,dj)  is  not  (iy,Jy)  +  (d i , d j ) ,  that  ie 

(  R2x,  R2y  )  x  (  S2x,  S2y  )  +  (  Rlx-Slx,  Rly-Sly  )  or 
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(  S2x,  S2y  )  -  (  R2x,  R2y  )  +  (  Rlx-Slx,  Rly-Sly  )  x  0  or 
(  Rlx-Slx-R2x+S2x,  Rly-Sly-R2y+S2y  )  x  0 
He  define  this  last  vector  to  te  6,  the  distortion  vector. 

For  non- tr1 via  I  camera  models  and  windows  larger  than  a  single  point, 
it  is  unlikely  that  this  vector  will  he  exactly  zero  for  dll  of  the  (di,dj) 
within  the  window.  Consequently,  there  will  almost  always  be  distortion  in  a 
continuous  image. 

However,  we  are  dealing,  not  with  continuous  imagee,  but  with  images 
which  are  represented  by  discrete  arrays.  When,  in  such  an  array,  the 
deeired  image  point  falls  between  elements  of  the  image  array,  there  are  two 
things  which  can  be  done.  Dne  can  approximate  the  deeired  pixel  by 
interpolating  the  neighboring  a-ray  elements,  or  one  can  simply  uee  the  array 
element  which  is  closest  to  the  desired  point.  In  correlating,  the  I  a  t  ter  ie 
the  more  common  practice. 

The  -ector  6l  -  Si  -  §1  will  ordinarily  be  such  that  if  its  tail  ie 
placed  on  an  integer  point  of  the  array,  its  head  will  also  fail  onto  an 
integer  point.  If  the  vector  02  -  62  -  §2  is  placed  with  its  tail  on  the 
same  integer  point  as  01,  its  head  will  probably  net  fall  on  the  head  of  01. 
However,  if  the  hsad  of  02  fails  within  1/2  pixel  in  each  co-ordinate  of  the 
head  of  01,  we  ran  not  really  tell  the  difference  in  position.  Thus,  for  a 
discrete  imege,  we  can  say  that  there  is  no  distortion  if,  for  ail  (di,dj) 
within  the  window,  the  x-  and  y-components  of  0  are  both  iese  than  1/2  pixel. 

LIHITING  DISTORTION 


Distortion  s  algebraically  a  vert  complicated  quantity,  for  it 
depends  on  thirt«en  parameters--the  pan  and  tilt  angles  which  describe  the 
direction  to  the  second  camera,  the  pan,  tilt,  ano  roil  angles  which  describe 
the  orientation  of  the  second  camera,  the  two  focal  lengths,  and  three 
parametere  each  to  describe  the  relative  3-space  points  R  and  S.  Graphing 
the  distortion  as  a  function  of  all  i3  of  theee  parametere  ie  obviouely  not 
feasible:  the  graph  is  impossible  to  represent  physical ly  and  excessively 
large  to  tabulate. 

If  one  holds  all  but  one  of  the  parameters  constant,  one  can  use  the 
limitation  on  the  components  of  the  distortion  vector  to  solve  for  limits  on 
the  last  parameter  which  will  guarantee  that  the  distortion  ie  small.  This 
ie  possible  analytically  (see  IFischler,  1971)  for  a  treatment  of  change  of 
focal  length  and  for  second  camera  roll  angle),  but  ie  usually  rather  meeey, 
hence  not  very  illuminating.  To  give  a  feel  for  the  resuite  for  particular 
parametere,  Illustration  0-1  tabulates  eome  of  the  dietortions  for  the  camera 
modei  of  the  barn  pictures  with  different  object  positions  and  or lentatione. 
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The  barn  camera  model  is  almost  the  standard  side-by-side  stereo 
model.  Tlie  second  camera  is  placed  at  81  degrees  of  pan  from  the  *  i  ■  et 
camera  and  .6  degrees  of  tilt,  that  Is,  to  the  right  of  the  first  cai era, 
slightly  forward  of  Its  position,  and  a  little  bit  higher.  Its  pointing  data 
is  -3.2  degrees  of  pan,  -1.3  degrees  of  tilt,  and  -1.4  degrees  of  roll,  that 
is,  it  is  pointed  slightly  to  the  left  (back  toward  the  first  camera),  down  a 
little,  with  a  minor  clockwise  roll. 

The  first  group  of  data  tabulates  the  distortion  for  two  points  in 
the  first  image  plane  (-50,10),  which  is  on  ths  corner  of  the  barn  door,  and 
(-55,5),  which  is  (-5,-5)  pixels  auay.  The  depths  to  the  corresponding 
3-space  points  are  kept  equai,  that  is,  the  3-space  points  are  both  on  a 
plane  perpendicular  to  the  first  camera's  principal  axis.  As  this  depth 
increases  from  one  meter  to  100  meters,  we  observe  the  resulting  changes  in 
the  distortion. 

The  second  group  of  data  uses  a  different  pair  cf  points  in  the  first 
image,  (10,10)  and  (17,17) — the  point  on  the  skyline  uhere  the  trees  shou 
somewhat  c  *  a  notch  and  a  point  (7,7)  pixels  away.  For  a  depth  of  10  meter6 
at  (10,10),  we  vary  the  depth  at  (17,17)  on  either  side  of  10  meters  and 
observe  the  results. 

In  the  third  group  of  data,  ue  have  used  the  sariie  first  point  (10,10) 
and  varied  the  vector  to  the  second  point,  in  effect  examining  the  effect  of 
varying  the  window  radius  from  1  to  10  pixels.  For  each  pair  of  points,  ue 
have  determined  (to  two  decimal  places)  the  depth  at  which  the  two  3-space 
points  would  have  to  lie  (both  at  ths  same  depth)  in  order  to  produce 
distortion  of  just  less  than  half  of  a  pixel. 

It  is  hoped  that  this  table  will  give  some  feel  for  the  relation 
between  depth,  window  size,  object  orientation,  and  distortion.  Those 
wishing  to  draw  specific  conclusions  about  the  allowable  windou  size,  etc., 
for  their  own  data  are  advised  to  program  the  mathematics  of  ths  last  section 
and  produce  similar  tables  for  their  camera  model,  since  the  distortion 
vectors  ulil  change  considsrabiy  with  changes  in  the  camera  model  parameters. 

Under  the  definition  of  the  last  section,  depth  discontinuities  are 
distortions.  However,  such  discontinuities  are  effectively  trans I  at i one, 
which  our  algorithms  can  handle  once  they  are  located,  so  ue  uill  exclude 
depth  discontinuities  from  the  following  discussions. 


SMALL  DISTORTIONS 


For  known  small  rotations  and  scale  factor  changes  it  is  possible  to 
choose  the  correlation  window  to  be  distortion-free  (Fischler,  1971).  This 
is  done  by  calculating  at  what  radius  the  global  distortion  causes  pixels  of 
matching  uindows  to  get  one  pixel  out  of  registration,  yielding  local 
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distortion.  Any  uindow  uhich  would  fit  into  a  square  of  this  radius  uill  be 
distortion-free,  at  least  from  this  source  of  distortion. 

For  other  minor  distortions,  weighting  the  correlation  uindow  ae 
ehown  in  Illustration  B-l  may  also  help.  (See  Appendix  B  for  an  explanation 
of  ueighted  correlation.)  Essentially,  this  6ays  that  ue  are  most  interested 
in  having  the  center  of  the  correlation  uindow  match  up,  and,  while  it  would 
be  nice  to  have  the  outer  parts  match  up,  it  should  not  greatly  effect  the 
correlation  if  they  do  not. 


GROSS  DISTORTIONS 


But  what  about  large  rotations  and  scale  factor  changes?  Large 
distortions  uill  cause  matching  to  fail,  since  it  causes  the  matching  process 
to  compare  points  uhich  do  not  correspond.  When  enough  points  do  not 
correspond,  the  correlation  uill  fill  below  the  confidence  level,  and  the 
areas  uill  fail  to  match.  This  necessitated  our  restrictions  on  the  kind  of 
pictures  ue  can  handle. 

Houever,  some  of  these  restrictions  can  be  lifted.  The  main 
technique  for  this  consists  of  figuring  out  uhat  caused  the  problem  and 
compensating  for  it.  Let  us  consider  some  of  the  causes  of  large  distortions 
and  6ee  uhat  can  be  done  about  them. 

Global  Rotations  and_ Scale  Factor  Changes 

Global  rotations  and  scale  factor  changes--those  affecting  the  uhole 
picture — are  caused  by  a  relative  roll  of  one  camera  about  its  focal  axis  and 
by  differences  in  the  focal  lengths  of  the  cameras,  respectively.  Pairs  of 
pictures  having  these  distortions  are  somewhat  rare.  The  human  prejudice  for 
order  usually  results  in  multiple  photographs  of  a  scsne  being  taken  with 
identical  cameras  and  lenses,  and  with  both  cameras  held  upright. 

There  exists  the  case  in  which  the  pictures  uere  taken  by  a  machine, 
such  as  an  independent  roving  vehicle.  Houever,  a  reasonable  design 
constraint  on  such  a  machine  is  for  it  to  monitor  its  orientation  with 
respect  to  the  uorld,  and  report  how  much  roll  ie  present  if  it  must  change 
angles.  One  would  also  expect  to  knou  if  the  focal  length  of  one  camera 
differed  from  that  of  the  other.  Given  this  data,  it  is  possible  to 
decal  ibrate  the  pictures,  that  is,  put  them  into  the  same  orientation  and 
scale  (Quam,  19711. 

In  the  rare  case  in  uhich  gross  rotations  or  scale  factor  changee  are 
present  but  of  unknoun  magnitude,  it  is  still  possible  to  get  rid  of  them. 
All  that  Is  required  is  to  determine  the  rotation  and  scale  factor  change. 
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If  the  locations  cf  enough  pairs  of  points  uere  known,  the  global 
rotation  and  scale  factor  change  could  be  computed  by  least  squares 
techniques  as  a  part  of  tht  camera  model  (See  Appendix  C).  This  requires 
collecting  several  pairs  of  corresponding  points.  Since  ue  have  assumed  that 
distortions  exist,  ue  cannot  use  matching  techniques,  uhich  depend  on  low 
distortion,  to  find  these  point  pairs. 

One  possible  method  for  discovering  these  correspondences  is  to 
extend  the  correlation  technique.  Instead  of  merely  searching  among  all 
possible  translations  of  the  uindou,  ly  -  lx  +  Cl  and  Jy  -  Jx  +  CJ,  we  could 
also  searches  among  all  possible  rotations  and  scale  factor  changes. 

IY  -  S*(  COS ((3) #IX  +  SIN(0)*JX  )  +  Cl  and 

JY  -  S*(-SIN(|3)*IX  +  COS(0)*JX  )  +  CJ 

These  neu  dimensions,  S  and  0,  will  have  to  be  quantized  in  order  to  make  the 
searches  finite.  The  window  size  used  for  the  correlation  will  determine  the 
maximum  quantization  possible  uithout  having  to  uorri,  about  distortion. 

This  search  in  A  variables  uill  be  very  long  and  slou;  some  method  of 
shortening  it  is  almost  manditory.  The  technique  of  reduction  will  still 
work  if  the  size  of  the  window  can  be  reduced  along  ui  th  the  picture. 
Gridding  uill  also  still  work  for  the  translation  part  of  the  search,  and  is 
inherent  in  the  quantization  of  the  rotation  angle  and  scale  factor. 
Similarity  uill  uork  only  if  the  properties  put  into  the  vectors  are 
invariant  under  rotation  and  scale  factor  change.  Camera  model  searches  are 
not  applicable,  since  ue  have  no  camera  model.  (If  we  did,  ue  uould  knou  the 
the  relative  rotation  and  focal  lengths,  and  uouldn't  be  looking  for  them  the 
hard  uay. ) 

Another  method  uhich  uill  give  the  rotations  and  scale  factor  changes 
directly  was  suggested  by  Lynn  Quam.  It  calls  for  locating  some  object  or 
area  lying  entirely  in  both  pictures  and  finding  its  boundary.  This  could  be 
accomplished  by  flat  region  grouing  (see  Chapter  5)  for  an  area  of  low 
variance,  by  conventional  edge  techniques  [Hueckel,  1972],  or  by  more 
sophisticated  region  growing  techniques  [Yakimovsky,  1973],  The  boundary  is 
then  tabulated  as  distance  from  the  center  of  mass  of  the  area  vs.  angle  from 
some  reference  direction.  It  is  now  possible  to  correlate  the  resulting 
function  tables  to  find  the  optimal  displacement  (i.e.  angle)  uhich  aligns 
them.  Once  the  rotational  alignment  is  determined,  the  tabulated  distances 
at  corresponding  points  on  the  boundaries  can  be  used  to  derive  the  scale 
factor  change,  as  could  the  ratio  of  the  perimeters. 

To  the  knowledge  of  this  author,  these  techniques  have  not  been 
implemented.  Since  totally  unknown  camera  roll  and  focal  length  change  tend 
to  be  the  exception,  rather  than  the  rule,  this  author  leaves  the  solution  to 
someone  who  has  the  problem. 
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Uo.ca L_Sca  le_F.ac t or  Changes 

Local  scale  factor  changes  occur  because  the  object  'ib  closer  to  one 
camera  than  to  the  other.  This  is  particularly  noticeable  in  forward  motion 
stereo,  as  might  be  taken  by  a  vehicle  rolling  along  some  path.  If  the 
approximate  local  scale  factor  is  known,  it  can  be  taken  into  account,  and  a 
correlation  function  which  does  mapping  instead  of  matching  can  be  employed 
to  determine  the  area  correspondence.  (See  Appendix  B  for  descriptions  of 
matching  and  mapping  correlation.) 

The  idea  of  finding  boundaries  and  thus  determining  the  relative 
scale  factor  chjnge  is  still  feasible;  however,  it  requires  knowing  where  the 
object  is  in  b  )th  pictures.  Since  this  might  well  be  the  information  we  are 
trying  to  determine,  this  approach  is  usually  not  practical. 

A  second  technique  recently  implemented  by  Quam  uses  a  camera  model 
Given  a  camera  model  and  a  pair  of  points,  it  is  computationally  rather 

simple  to  determine  the  distance  from  each  camera  to  the  point  in  3-space 
corresponding  to  those  two  points  (See  Appendix  C) .  For  each  proposed 
mapping,  these  distances  are  calculated  using  the  cente-*  points  of  the 

proposed  corresponding  areas.  From  these  distances  and  the  focal  lengths, 
one  calculates  the  effective  difference  In  scale  between  the  two  areas  so 
that  mapping  tables  can  be  set  and  the  correlation  evaluated. 

When  the  object  lies  at  the  same  distance  from  both  cameras,  out  with 
a  face  at  a  large  angle  to  the  camera  baseline,  scale  distortion  occurs 

primarily  in  one  dimension  in  the  image— the  dimension  most  nearly  parallel 
to  the  camera  baseline.  For  instance,  in  the  barn  pictures  the  barn  face  is 
distorted  from  one  view  to  the  other,  but  the  distortion  is  primarily 

horizontal— the  direction  of  the  camera  baseline.  This  suggests  making  the 
correlation  window  more  narrow  in  the  direction  of  the  camera  baseline  in 
order  to  reduce  distortion  to  allow  matching  to  take  place. 

Most  other  distortions  cause  the  two  images  to  contain  different 
projections  of  the  object.  In  general,  it  is  unlikely  that  more  than  a  small 
part  of  an  object  so  distorted  can  be  matched.  Mapping  i  possible,  of 
course — IF  one  has  a  3-D  model  of  the  object,  knows  the  o  iginal  3-space 
position  and  orientation  of  the  object  and  how  it  changed,  anc  has  a  reliable 
camera  model.  However,  if  one  already  knows  that  much  about  the  scene,  there 
ie  little  point  in  doing  matching,  or  any  other  vision  work. 


For  a  given  pair  of  points  (Ir.Jr/  and  (Ie,Js),  examine  the  distortion 
vector  (Dx,Dy)  as  a  function  of  depth,  r-depth  -  s-depth,  i.e. 
all  of  the  3-space  points  are  in  a  plane  perpendicular  to  the 
first  camera’ s  pr incipal  axis. 


(Ir.Jr) 

r-depth 

(I s, Js) 

s-depth 

Dx 

Dy 

-50  10 

1.000 

-55  5 

1.000 

-.127 

-.565 

-50  10 

1.500 

-55  5 

1.500 

-.113 

-.307 

-50  10 

2.000 

-55  5 

2.000 

-.099 

-.186 

-50  10 

5.000 

-55  5 

5.000 

-.055 

.017 

-50  10 

10.000 

-55  5 

10.000 

-.051 

.082 

-50  10 

15.000 

-55  5 

15.000 

-.046 

.103 

-50  10 

20.000 

-55  5 

28.000 

-.044 

.113 

-50  10 

50.000 

-55  5 

50.000 

-.040 

.132 

-50  10 

100.000 

-55  5 

100.000 

-.038 

.138 

For  a  given  pair 

of  points  and 

r-depth, 

examine  the 

distortion 

vector 

as  the  e- 

-depth  varies. 

(Ir.Jr) 

r-depth 

(I  s,  Js) 

s-depth 

Dx 

Dy 

10  10 

10.000 

17  17 

10.050 

.515 

-.028 

10  10 

10.000 

17  17 

10.050 

.474 

-.027 

10  10 

10.000 

17  17 

10.000 

.272 

-.023 

10  10 

10.000 

17  17 

9.900 

-.138 

-.015 

10  10 

10.080 

17  17 

9.820 

-.472 

-.009 

10  10 

10.000 

17  17 

9.810 

-.514 

-.008 

For  given  (Ir.Jr)  and  r-depth  -  s-depth,  find  approximate  depth  at  uhich 
the  maximum  distortion  of  .5  occurs  for  a  variety  of  (ls,Js). 


( I  r , Jr ) 

r-depth 

(1  s, Js) 

s-depth 

Dx 

Dy 

10  10 

.380 

11  11 

.380 

-.024 

.487 

10  10 

.610 

12  12 

.610 

.105 

.492 

10  10 

.820 

13  13 

.820 

.176 

.499 

10  10 

1.030 

14  14 

1.030 

.231 

.496 

10  10 

1.220 

15  15 

1.220 

.281 

.499 

10  ‘.0 

2.240 

20  20 

2.240 

.500 

.448 

Illustration  D-l.  A  table  of  the  distortion  vectors  in  the  barn  pictures  for 
different  object  positions  and  orientations.  The  depths  given  are  in  moters 
and  repreeent  the  z-coordinate  of  that  particular  3-space  point. 
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